Implementing the Steepest Descent Algorithm in Python from Scratch | by Nicolo Cosimo Albanese | Feb, 2023


Image by author.
  1. Introduction
  2. The steepest descent algorithm
    2.1 The search direction
    2.2 The step size
    2.3 The algorithm
  3. Implementation
    3.1 Constant step size
    3.2 Line search with the Armijo condition
  4. Conclusions

Optimization is the process of finding the set of variables x that minimize or maximize an objective function f(x). Since maximizing a function is equivalent to minimizing its negative, we may focus on minimization problems alone:

For our example, let us define a quadratic, multivariable objective function f(x) as follows:

Its gradient ∇f(x) is

import numpy as np

def f(x):
'''Objective function'''
return 0.5*(x[0] - 4.5)**2 + 2.5*(x[1] - 2.3)**2

def df(x):
'''Gradient of the objective function'''
return np.array([x[0] - 4.5, 5*(x[1] - 2.3)])

One may leverage the helpful scipy.optimize.minimize function from the popular SciPy library to rapidly find the optimum:

from scipy.optimize import minimize

result = minimize(
f, np.zeros(2), method='trust-constr', jac=df)

result.x

array([4.5, 2.3])

We can plot the objective function and its minimizer:

import matplotlib.pyplot as plt

# Prepare the objective function between -10 and 10
X, Y = np.meshgrid(np.linspace(-10, 10, 20), np.linspace(-10, 10, 20))
Z = f(np.array([X, Y]))

# Minimizer
min_x0, min_x1 = np.meshgrid(result.x[0], result.x[1])
min_z = f(np.stack([min_x0, min_x1]))

# Plot
fig = plt.figure(figsize=(15, 20))

# First subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(40, 20)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90);

Image by author.

Let us now introduce the steepest descent algorithm and implement it from scratch. Our goal is to solve the optimization problem and find the minimum [4.5, 2.3].

To solve the optimization problem minₓ f(x), we start by positioning ourselves in some point in the coordinate space. From there, we move iteratively towards a better approximation of the minimum of f(x) through a search direction p:

In this expression:

  • x is the input variable;
  • p is the search direction;
  • α > 0 is the step size or step length; it describes how much we should move along the direction p in each iteration k.

This approach requires a proper selection of the step size α and the search direction p.

2.1. The search direction

As search direction, the steepest descent algorithm uses the negative gradient -∇f(xₖ) evaluated in the current iterate xₖ. This is a logical choice, since the negative gradient of a function always points to the direction in which the function decreases the most.

Therefore, we can rewrite our expression as:

Since the minimum is a stationary point, it is reasonable to stop the algorithm when the norm of the gradient is smaller than a given tolerance: if the gradient reaches zero, we may have found a minimum.

2.2. The step size

As we are trying to minimize f(x), the ideal step size α is the minimizer of the following objective function φ(α):

Unfortunately, this would require to solve the additional optimization task within the current optimization task: minₓ f(x). Moreover, although φ(α) is univariate, finding its minimizer may require too many evaluations of f(x) and its gradient.

In brief, we are looking for a trade-off between the selection of α and the time we need to make the choice. To this aim, we could simply pick a value of the step size that ensures that the objective function will decrease at least of a certain amount. Indeed, a popular inexact line search condition states that α should lead to a sufficient decrease of f(x) through the following inequality:

In literature, this is known as sufficient decrease or Armijo condition, and it belongs to the set of Wolfe conditions.

The constant c is chosen to be small; a common value is 10^-4.

Since the steepest descent uses the negative gradient -∇f(xₖ) as search direction pₖ, the expression + ∇f(xₖ)^T * pₖ is equal to the negative square norm of the gradient. In Python: -np.linalg.norm(∇f(xₖ))**2. Therefore, our condition of sufficient decrease becomes:

2.3. The algorithm

We can now write the necessary steps required in the steepest descent method:

  1. Select a starting point x = x₀
  2. Select a maximum number of iterations M
  3. Select a tolerance tol close to zero to evaluate the gradient
  4. Set the iteration counter n = 1
  5. Repeat in loop:
    5.1 Update α through the Armijo condition (line search)
    5.2 Construct the next point x = x - α ⋅ ∇f(x)
    5.3 Evaluate the new gradient ∇f(x)
    5.4 Update the iteration counter n = n + 1
    5.5 If the norm of the current gradient is sufficiently small ||∇f(x)|| < tol or the maximum number of iterations has been reached n = M, then exit the loop
  6. Return x

Let us implement it in Python.

In this section, we share an implementation of the steepest descent algorithm. In particular, we proceed by steps:

  1. We start with a constant step size, and then
  2. we add the line search with the Armijo condition.

3.1 Constant step size

Let us begin by implementing a simplified version of our iterative approach

where we apply a constant value for the step size α through all the iterations. Our purpose is to practically verify that convergence is not guaranteed for just any constant α, thus requiring to implement the line search:

def steepest_descent(gradient, x0 = np.zeros(2), alpha = 0.01, max_iter = 10000, tolerance = 1e-10): 
'''
Steepest descent with constant step size alpha.

Args:
- gradient: gradient of the objective function
- alpha: line search parameter (default: 0.01)
- x0: initial guess for x_0 and x_1 (default values: zero) <numpy.ndarray>
- max_iter: maximum number of iterations (default: 10000)
- tolerance: minimum gradient magnitude at which the algorithm stops (default: 1e-10)

Out:
- results: <numpy.ndarray> of size (n_iter, 2) with x_0 and x_1 values at each iteration
'''

# Initialize the iteration counter
iter_count = 1

# Prepare list to store results at each iteration
results = np.array([])

# Evaluate the gradient at the starting point
gradient_x = gradient(x0)

# Set the initial point
x = x0
results = np.append(results, x, axis=0)

# Iterate until the gradient is below the tolerance or maximum number of iterations is reached
# Stopping criterion: inf norm of the gradient (max abs)
while any(abs(gradient_x) > tolerance) and iter_count < max_iter:

# Update the current point by moving in the direction of the negative gradient
x = x - alpha * gradient_x

# Store the result
results = np.append(results, x, axis=0)

# Evaluate the gradient at the new point
gradient_x = gradient(x)

# Increment the iteration counter
iter_count += 1

# Return the points obtained at each iteration
return results.reshape(-1, 2)

Let us use this function to solve our optimization task:

estimate = steepest_descent(df, x0 = np.array([-9, -9]), alpha=0.30)

print('Final results: {}'.format(estimate[-1]))
print('N° iterations: {}'.format(len(estimate)))

Final results: [4.5 2.3]
N° iterations: 73

With α = 0.3, the minimum is reached in 73 iterations. We can plot the points x at each iteration:

# Steepest descent steps
X_estimate, Y_estimate = estimate[:, 0], estimate[:, 1]
Z_estimate = f(np.array([X_estimate, Y_estimate]))

# Plot
fig = plt.figure(figsize=(20, 20))

# First subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(20, 20)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90);

Image by author.

We observe how the steepest descent is characterized by a distinguishing “zig-zagging” path towards the minimum. This is due to the choice of the negative gradient -∇f(x) as search direction p.

As we discussed, convergence is not guaranteed for any value of the step size. In the previous example, we applied a constant step size α = 0.3, but what would have happened with a different choice?

# Step sizes to be tested
alphas = [0.01, 0.25, 0.3, 0.35, 0.4]

# Store the iterations for each step size
X_estimates, Y_estimates, Z_estimates = [], [], []

# Plot f(x) at each iteration for different step sizes
fig, ax = plt.subplots(len(alphas), figsize=(8, 9))
fig.suptitle('$f(x)$ at each iteration for different $α$')

# For each step size
for i, alpha in enumerate(alphas):

# Steepest descent
estimate = steepest_descent(
df, x0 = np.array([-5, -5]), alpha=alpha, max_iter=3000)

# Print results
print('Input alpha: {}'.format(alpha))
print('\t- Final results: {}'.format(estimate[-1].round(1)))
print('\t- N° iterations: {}'.format(len(estimate)))

# Store for 3D plots
X_estimates.append(estimate[:, 0])
Y_estimates.append(estimate[:, 1])
Z_estimates.append(f(np.array([estimate[:, 0], estimate[:, 1]])))

# Subplot of f(x) at each iteration for current alpha
ax[i].plot([f(var) for var in estimate], label='alpha: '+str(alpha))
ax[i].axhline(y=0, color='r', alpha=0.7, linestyle='dashed')
ax[i].set_xlabel('Number of iterations')
ax[i].set_ylabel('$f(x)$')
ax[i].set_ylim([-10, 200])
ax[i].legend(loc='upper right')

Input alpha: 0.01
- Final results: [4.5 2.3]
- N° iterations: 2517
Input alpha: 0.25
- Final results: [4.5 2.3]
- N° iterations: 89
Input alpha: 0.3
- Final results: [4.5 2.3]
- N° iterations: 72
Input alpha: 0.35
- Final results: [4.5 2.3]
- N° iterations: 94
Input alpha: 0.4
- Final results: [4.5 9.6]
- N° iterations: 3000
Image by author.

When the step size is too large (α = 0.4), the algorithm does not converge: xₖ keeps oscillating without reaching the minimum until the maximum number of iterations is hit.

We can better appreciate this behavior by observing the points x at each iteration:

fig = plt.figure(figsize=(25, 60))

# For each step size
for i in range(0, len(alphas)):

# First subplot
ax = fig.add_subplot(5, 2, (i*2)+1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimates[i], Y_estimates[i], Z_estimates[i], color='red', label='alpha: '+str(alphas[i]) , linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(20, 20)
plt.legend(prop={'size': 15})

# Second third
ax = fig.add_subplot(5, 2, (i*2)+2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimates[i], Y_estimates[i], Z_estimates[i], color='red', label='alpha: '+str(alphas[i]) , linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90)
plt.legend(prop={'size': 15})

Image by author.

To guarantee the convergence of the steepest descent, we need to update iteratively α through the line search.

3.3. Line search with the Armijo condition

Let us modify our previous method by adding the Armijo rule. In the steepest descent loop, before calculating the next point x - α ⋅ ∇f(x) we need to choose a suitable step size: we start from the initial guess α = 1 and half its value until the Armijo condition is met. This procedure is called backtracking line search.

def line_search(step, x, gradient_x, c = 1e-4, tol = 1e-8):
'''
Inexact line search where the step length is updated through the Armijo condition:
$ f (x_k + α * p_k ) ≤ f ( x_k ) + c * α * ∇ f_k^T * p_k $

Args:
- step: starting alpha value
- x: current point
- gradient_x: gradient of the current point
- c: constant value; default: 1e-4
- tol: tolerance value
Out:
- New value of step: the first value found respecting the Armijo condition
'''
f_x = f(x)
gradient_square_norm = np.linalg.norm(gradient_x)**2

# Until the sufficient decrease condition is met
while f(x - step * gradient_x) >= (f_x - c * step * gradient_square_norm):

# Update the stepsize (backtracking)
step /= 2

# If the step size falls below a certain tolerance, exit the loop
if step < tol:
break

return step

def steepest_descent(gradient, x0 = np.zeros(2), max_iter = 10000, tolerance = 1e-10):
'''
Steepest descent with alpha updated through line search (Armijo condition).

Args:
- gradient: gradient of the objective function
- x0: initial guess for x_0 and x_1 (default values: zero) <numpy.ndarray>
- max_iter: maximum number of iterations (default: 10000)
- tolerance: minimum gradient magnitude at which the algorithm stops (default: 1e-6)

Out:
- results: <numpy.ndarray> of size (n_iter, 2) with x_0 and x_1 values at each iteration
'''

# Initialize the iteration counter and the step size
iter_count = 1

# Prepare list to store results at each iteration
results = np.array([])

# Evaluate the gradient at the starting point
gradient_x = gradient(x0)

# set the initial point
x = x0
results = np.append(results, x, axis=0)

# Iterate until the gradient is below the tolerance or maximum number of iterations is reached
# Stopping criterion: inf norm of the gradient (max abs)
while any(abs(gradient_x) > tolerance) and iter_count < max_iter:

# Update the step size through the Armijo condition
# Note: the first value of alpha is commonly set to 1
alpha = line_search(1, x, gradient_x)

# Update the current point by moving in the direction of the negative gradient
x = x - alpha * gradient_x

# Store the result
results = np.append(results, x, axis=0)

# Evaluate the gradient at the new point
gradient_x = gradient(x)

# Increment the iteration counter
iter_count += 1

# Return the points obtained at each iteration
return results.reshape(-1, 2)

Let us now optimize the objective function with the steepest descent with line search:

# Steepest descent
estimate = steepest_descent(
df, x0 = np.array([-9, -9]), max_iter=3000)

# Print results
print('- Final results: {}'.format(estimate[-1].round(1)))
print('- N° iterations: {}'.format(len(estimate)))

# Steepest descent steps
X_estimate, Y_estimate = estimate[:, 0], estimate[:, 1]
Z_estimate = f(np.array([X_estimate, Y_estimate]))

# Plot
fig = plt.figure(figsize=(20, 20))

# First subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(20, 20)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90);

- Final results: [4.5 2.3]
- N° iterations: 56
Image by author.

We notice that we reached the minimum with 56 iterations, while the use of a constant step size α led to a larger number of iterations, or it did not lead to convergence.

In this post, we introduced and implemented the steepest descent method, and tested it on a quadratic function in two variables. In particular, we showed how to update iteratively the step size with the sufficient decrease condition (Armijo).

The steepest descent with the Armijo line search is guaranteed to converge, but it is, in general, quite slow, as it requires a large number of iterations.

In future posts, we will explore a modification of this basic algorithm by changing the line search strategy and providing a different initialization approach for the step size.


Image by author.
  1. Introduction
  2. The steepest descent algorithm
    2.1 The search direction
    2.2 The step size
    2.3 The algorithm
  3. Implementation
    3.1 Constant step size
    3.2 Line search with the Armijo condition
  4. Conclusions

Optimization is the process of finding the set of variables x that minimize or maximize an objective function f(x). Since maximizing a function is equivalent to minimizing its negative, we may focus on minimization problems alone:

For our example, let us define a quadratic, multivariable objective function f(x) as follows:

Its gradient ∇f(x) is

import numpy as np

def f(x):
'''Objective function'''
return 0.5*(x[0] - 4.5)**2 + 2.5*(x[1] - 2.3)**2

def df(x):
'''Gradient of the objective function'''
return np.array([x[0] - 4.5, 5*(x[1] - 2.3)])

One may leverage the helpful scipy.optimize.minimize function from the popular SciPy library to rapidly find the optimum:

from scipy.optimize import minimize

result = minimize(
f, np.zeros(2), method='trust-constr', jac=df)

result.x

array([4.5, 2.3])

We can plot the objective function and its minimizer:

import matplotlib.pyplot as plt

# Prepare the objective function between -10 and 10
X, Y = np.meshgrid(np.linspace(-10, 10, 20), np.linspace(-10, 10, 20))
Z = f(np.array([X, Y]))

# Minimizer
min_x0, min_x1 = np.meshgrid(result.x[0], result.x[1])
min_z = f(np.stack([min_x0, min_x1]))

# Plot
fig = plt.figure(figsize=(15, 20))

# First subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(40, 20)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90);

Image by author.

Let us now introduce the steepest descent algorithm and implement it from scratch. Our goal is to solve the optimization problem and find the minimum [4.5, 2.3].

To solve the optimization problem minₓ f(x), we start by positioning ourselves in some point in the coordinate space. From there, we move iteratively towards a better approximation of the minimum of f(x) through a search direction p:

In this expression:

  • x is the input variable;
  • p is the search direction;
  • α > 0 is the step size or step length; it describes how much we should move along the direction p in each iteration k.

This approach requires a proper selection of the step size α and the search direction p.

2.1. The search direction

As search direction, the steepest descent algorithm uses the negative gradient -∇f(xₖ) evaluated in the current iterate xₖ. This is a logical choice, since the negative gradient of a function always points to the direction in which the function decreases the most.

Therefore, we can rewrite our expression as:

Since the minimum is a stationary point, it is reasonable to stop the algorithm when the norm of the gradient is smaller than a given tolerance: if the gradient reaches zero, we may have found a minimum.

2.2. The step size

As we are trying to minimize f(x), the ideal step size α is the minimizer of the following objective function φ(α):

Unfortunately, this would require to solve the additional optimization task within the current optimization task: minₓ f(x). Moreover, although φ(α) is univariate, finding its minimizer may require too many evaluations of f(x) and its gradient.

In brief, we are looking for a trade-off between the selection of α and the time we need to make the choice. To this aim, we could simply pick a value of the step size that ensures that the objective function will decrease at least of a certain amount. Indeed, a popular inexact line search condition states that α should lead to a sufficient decrease of f(x) through the following inequality:

In literature, this is known as sufficient decrease or Armijo condition, and it belongs to the set of Wolfe conditions.

The constant c is chosen to be small; a common value is 10^-4.

Since the steepest descent uses the negative gradient -∇f(xₖ) as search direction pₖ, the expression + ∇f(xₖ)^T * pₖ is equal to the negative square norm of the gradient. In Python: -np.linalg.norm(∇f(xₖ))**2. Therefore, our condition of sufficient decrease becomes:

2.3. The algorithm

We can now write the necessary steps required in the steepest descent method:

  1. Select a starting point x = x₀
  2. Select a maximum number of iterations M
  3. Select a tolerance tol close to zero to evaluate the gradient
  4. Set the iteration counter n = 1
  5. Repeat in loop:
    5.1 Update α through the Armijo condition (line search)
    5.2 Construct the next point x = x - α ⋅ ∇f(x)
    5.3 Evaluate the new gradient ∇f(x)
    5.4 Update the iteration counter n = n + 1
    5.5 If the norm of the current gradient is sufficiently small ||∇f(x)|| < tol or the maximum number of iterations has been reached n = M, then exit the loop
  6. Return x

Let us implement it in Python.

In this section, we share an implementation of the steepest descent algorithm. In particular, we proceed by steps:

  1. We start with a constant step size, and then
  2. we add the line search with the Armijo condition.

3.1 Constant step size

Let us begin by implementing a simplified version of our iterative approach

where we apply a constant value for the step size α through all the iterations. Our purpose is to practically verify that convergence is not guaranteed for just any constant α, thus requiring to implement the line search:

def steepest_descent(gradient, x0 = np.zeros(2), alpha = 0.01, max_iter = 10000, tolerance = 1e-10): 
'''
Steepest descent with constant step size alpha.

Args:
- gradient: gradient of the objective function
- alpha: line search parameter (default: 0.01)
- x0: initial guess for x_0 and x_1 (default values: zero) <numpy.ndarray>
- max_iter: maximum number of iterations (default: 10000)
- tolerance: minimum gradient magnitude at which the algorithm stops (default: 1e-10)

Out:
- results: <numpy.ndarray> of size (n_iter, 2) with x_0 and x_1 values at each iteration
'''

# Initialize the iteration counter
iter_count = 1

# Prepare list to store results at each iteration
results = np.array([])

# Evaluate the gradient at the starting point
gradient_x = gradient(x0)

# Set the initial point
x = x0
results = np.append(results, x, axis=0)

# Iterate until the gradient is below the tolerance or maximum number of iterations is reached
# Stopping criterion: inf norm of the gradient (max abs)
while any(abs(gradient_x) > tolerance) and iter_count < max_iter:

# Update the current point by moving in the direction of the negative gradient
x = x - alpha * gradient_x

# Store the result
results = np.append(results, x, axis=0)

# Evaluate the gradient at the new point
gradient_x = gradient(x)

# Increment the iteration counter
iter_count += 1

# Return the points obtained at each iteration
return results.reshape(-1, 2)

Let us use this function to solve our optimization task:

estimate = steepest_descent(df, x0 = np.array([-9, -9]), alpha=0.30)

print('Final results: {}'.format(estimate[-1]))
print('N° iterations: {}'.format(len(estimate)))

Final results: [4.5 2.3]
N° iterations: 73

With α = 0.3, the minimum is reached in 73 iterations. We can plot the points x at each iteration:

# Steepest descent steps
X_estimate, Y_estimate = estimate[:, 0], estimate[:, 1]
Z_estimate = f(np.array([X_estimate, Y_estimate]))

# Plot
fig = plt.figure(figsize=(20, 20))

# First subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(20, 20)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90);

Image by author.

We observe how the steepest descent is characterized by a distinguishing “zig-zagging” path towards the minimum. This is due to the choice of the negative gradient -∇f(x) as search direction p.

As we discussed, convergence is not guaranteed for any value of the step size. In the previous example, we applied a constant step size α = 0.3, but what would have happened with a different choice?

# Step sizes to be tested
alphas = [0.01, 0.25, 0.3, 0.35, 0.4]

# Store the iterations for each step size
X_estimates, Y_estimates, Z_estimates = [], [], []

# Plot f(x) at each iteration for different step sizes
fig, ax = plt.subplots(len(alphas), figsize=(8, 9))
fig.suptitle('$f(x)$ at each iteration for different $α$')

# For each step size
for i, alpha in enumerate(alphas):

# Steepest descent
estimate = steepest_descent(
df, x0 = np.array([-5, -5]), alpha=alpha, max_iter=3000)

# Print results
print('Input alpha: {}'.format(alpha))
print('\t- Final results: {}'.format(estimate[-1].round(1)))
print('\t- N° iterations: {}'.format(len(estimate)))

# Store for 3D plots
X_estimates.append(estimate[:, 0])
Y_estimates.append(estimate[:, 1])
Z_estimates.append(f(np.array([estimate[:, 0], estimate[:, 1]])))

# Subplot of f(x) at each iteration for current alpha
ax[i].plot([f(var) for var in estimate], label='alpha: '+str(alpha))
ax[i].axhline(y=0, color='r', alpha=0.7, linestyle='dashed')
ax[i].set_xlabel('Number of iterations')
ax[i].set_ylabel('$f(x)$')
ax[i].set_ylim([-10, 200])
ax[i].legend(loc='upper right')

Input alpha: 0.01
- Final results: [4.5 2.3]
- N° iterations: 2517
Input alpha: 0.25
- Final results: [4.5 2.3]
- N° iterations: 89
Input alpha: 0.3
- Final results: [4.5 2.3]
- N° iterations: 72
Input alpha: 0.35
- Final results: [4.5 2.3]
- N° iterations: 94
Input alpha: 0.4
- Final results: [4.5 9.6]
- N° iterations: 3000
Image by author.

When the step size is too large (α = 0.4), the algorithm does not converge: xₖ keeps oscillating without reaching the minimum until the maximum number of iterations is hit.

We can better appreciate this behavior by observing the points x at each iteration:

fig = plt.figure(figsize=(25, 60))

# For each step size
for i in range(0, len(alphas)):

# First subplot
ax = fig.add_subplot(5, 2, (i*2)+1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimates[i], Y_estimates[i], Z_estimates[i], color='red', label='alpha: '+str(alphas[i]) , linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(20, 20)
plt.legend(prop={'size': 15})

# Second third
ax = fig.add_subplot(5, 2, (i*2)+2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimates[i], Y_estimates[i], Z_estimates[i], color='red', label='alpha: '+str(alphas[i]) , linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90)
plt.legend(prop={'size': 15})

Image by author.

To guarantee the convergence of the steepest descent, we need to update iteratively α through the line search.

3.3. Line search with the Armijo condition

Let us modify our previous method by adding the Armijo rule. In the steepest descent loop, before calculating the next point x - α ⋅ ∇f(x) we need to choose a suitable step size: we start from the initial guess α = 1 and half its value until the Armijo condition is met. This procedure is called backtracking line search.

def line_search(step, x, gradient_x, c = 1e-4, tol = 1e-8):
'''
Inexact line search where the step length is updated through the Armijo condition:
$ f (x_k + α * p_k ) ≤ f ( x_k ) + c * α * ∇ f_k^T * p_k $

Args:
- step: starting alpha value
- x: current point
- gradient_x: gradient of the current point
- c: constant value; default: 1e-4
- tol: tolerance value
Out:
- New value of step: the first value found respecting the Armijo condition
'''
f_x = f(x)
gradient_square_norm = np.linalg.norm(gradient_x)**2

# Until the sufficient decrease condition is met
while f(x - step * gradient_x) >= (f_x - c * step * gradient_square_norm):

# Update the stepsize (backtracking)
step /= 2

# If the step size falls below a certain tolerance, exit the loop
if step < tol:
break

return step

def steepest_descent(gradient, x0 = np.zeros(2), max_iter = 10000, tolerance = 1e-10):
'''
Steepest descent with alpha updated through line search (Armijo condition).

Args:
- gradient: gradient of the objective function
- x0: initial guess for x_0 and x_1 (default values: zero) <numpy.ndarray>
- max_iter: maximum number of iterations (default: 10000)
- tolerance: minimum gradient magnitude at which the algorithm stops (default: 1e-6)

Out:
- results: <numpy.ndarray> of size (n_iter, 2) with x_0 and x_1 values at each iteration
'''

# Initialize the iteration counter and the step size
iter_count = 1

# Prepare list to store results at each iteration
results = np.array([])

# Evaluate the gradient at the starting point
gradient_x = gradient(x0)

# set the initial point
x = x0
results = np.append(results, x, axis=0)

# Iterate until the gradient is below the tolerance or maximum number of iterations is reached
# Stopping criterion: inf norm of the gradient (max abs)
while any(abs(gradient_x) > tolerance) and iter_count < max_iter:

# Update the step size through the Armijo condition
# Note: the first value of alpha is commonly set to 1
alpha = line_search(1, x, gradient_x)

# Update the current point by moving in the direction of the negative gradient
x = x - alpha * gradient_x

# Store the result
results = np.append(results, x, axis=0)

# Evaluate the gradient at the new point
gradient_x = gradient(x)

# Increment the iteration counter
iter_count += 1

# Return the points obtained at each iteration
return results.reshape(-1, 2)

Let us now optimize the objective function with the steepest descent with line search:

# Steepest descent
estimate = steepest_descent(
df, x0 = np.array([-9, -9]), max_iter=3000)

# Print results
print('- Final results: {}'.format(estimate[-1].round(1)))
print('- N° iterations: {}'.format(len(estimate)))

# Steepest descent steps
X_estimate, Y_estimate = estimate[:, 0], estimate[:, 1]
Z_estimate = f(np.array([X_estimate, Y_estimate]))

# Plot
fig = plt.figure(figsize=(20, 20))

# First subplot
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.view_init(20, 20)

# Second subplot
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.contour3D(X, Y, Z, 60, cmap='viridis')
ax.plot(X_estimate, Y_estimate, Z_estimate, color='red', linewidth=3)
ax.scatter(min_x0, min_x1, min_z, marker='o', color='red', linewidth=10)
ax.set_xlabel('$x_{0}$')
ax.set_ylabel('$x_{1}$')
ax.set_zlabel('$f(x)$')
ax.axes.zaxis.set_ticklabels([])
ax.view_init(90, -90);

- Final results: [4.5 2.3]
- N° iterations: 56
Image by author.

We notice that we reached the minimum with 56 iterations, while the use of a constant step size α led to a larger number of iterations, or it did not lead to convergence.

In this post, we introduced and implemented the steepest descent method, and tested it on a quadratic function in two variables. In particular, we showed how to update iteratively the step size with the sufficient decrease condition (Armijo).

The steepest descent with the Armijo line search is guaranteed to converge, but it is, in general, quite slow, as it requires a large number of iterations.

In future posts, we will explore a modification of this basic algorithm by changing the line search strategy and providing a different initialization approach for the step size.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AlbaneseAlgorithmartificial intelligenceCosimoDescentFebimplementingmachine learningNicolopythonScratchsteepestTechnology
Comments (0)
Add Comment