Gradient Descent Algorithm 101. Understand the optimization algorithm… | by Pol Marin | Apr, 2023


Beginner-friendly guide

Understand the optimization algorithm widely used in Machine and Deep Learning

The slope of a mountain — Photo by Ralph (Ravi) Kayden on Unsplash

Imagine you are a drop of water on top of a mountain, and your goal is to get to the lake situated right at the base of the mountain. That tall mountain has different slopes and obstacles, so going down following a straight line might not be the best solution. How would you approach this problem? The best solution would arguably be taking little steps, one at a time, always heading toward the direction that brings you closer to your end goal.

Gradient Descent (GD) is the algorithm that does just that, and it is essential for any data scientist to understand. It’s basic and rather simple but crucial, and anyone willing to enter the field should be able to explain what it is.

In this post, my goal is to make a complete and beginner-friendly guide to make everyone understand what GD is, what’s it used for, how it works, and mention different variations of it.

As always, you’ll find the resources section at the end of the post.

But first things first.

Introduction

Using Wikipedia’s definition[1], Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Even though it’s surely not the most effective method, it’s commonly used in Machine Learning and Deep Learning, especially in Neural Networks.

It’s basically used to minimize the value of a function by updating a set of parameters on each iteration. Mathematically speaking, it uses the derivative (gradient) to gradually decrease (descent) its value.

But there’s a catch: not all functions are optimizable. We require a function — either uni or multivariate — that’s differentiable, which means derivatives exist at each point in the function’s domain, and convex (U-shape or similar).

Now, after this simple introduction, we can start digging a little bit deeper into the math behind it.

Practical Case

Because all gets clearer when going beyond the theory, let’s use real numbers and values to understand what it does.

Let’s use a common data science case in which we want to develop a regression model.

Disclaimer: I have totally invented this and there’s no logical reasoning behind using these functions, all came randomly. The goal is to show the process itself.

The cost function or loss function in any data science problem is the function we want to optimize. As we’re using regression, we’re going to use this one:

Random regression function — Image by the author

The goal is to find the optimal minimum of f(x,y). Let me plot what it looks like:

f(x,y) plotted with a=1 — Image by the author

Now our goal is to get the proper values for “x” and “y” that let us find the optimal values of this cost function. We can already see it graphically:

  • y=0
  • x being either -1 or 1

Onto the GD itself, because we want to make our machine learn to do the same.

The Algorithm

As said, gradient descent is an iterative process in which we compute the gradient and move in the opposite direction. The reasoning behind this is that the gradient of a function is used to determine the slope of that function. As we want to move down, not up, then we move in the opposite way.

It’s a simple process in which we update x and y in each iteration, by following the next approach:

Parameter update in gradient descent — Image by the author

Explained in words, at iteration k:

  1. Compute the gradient using the values of x and y at that iteration.
  2. For each of those variables — x and y — multiply its gradient times lambda (𝜆), which is a float number called the learning rate.
  3. Remove from x and y respectively the computed values in step 2.
  4. Make x and y have the new value in the next iteration.

This process is then repeated until a certain condition is met (that’s not important today). Once that happens, the training finishes and the optimization does too. We are (or should be) at a minimum (either local or global).

Now, let’s put this theory into practice.

The first thing we need to do is compute the gradient of f(x,y). The gradient corresponds to a vector of partial derivatives:

Gradient of f(x,y) — Image by the author

Now, using Python, all I’m going to do is create a loop that iteratively computes the gradient — using the corresponding x and y — and updates these parameters as specified above.

Before that, I’ll define two more values:

  • The learning rate (𝜆) can be fixed or mobile. For this simple tutorial, it’ll be 0.01.
  • I’ll also use a value called eps (epsilon) to determine when to finish iterating. Once both partial derivatives are below this threshold, the gradient descent will stop. I’m setting it to 0.0001.

Now, let’s do some code:

import random

# Define constants
eps = 0.0001
lr = 0.01

# Initialize x and y with random values
x = random.uniform(-2, 4)
y = random.uniform(-1, 1)

def f(x,y):
return (x**2 -1)**2 +y**2

def df_x(x):
return 4*x*(x**2 - 1)

def df_y(y):
return 2*y

# Perform gradient descent
while max(df_x(x), df_y(y)) >= eps:
x = x - lr * df_x(x)
y = y - lr * df_y(y)

# Print optimal values found
print(f'x = {x}, y = {y}')

The output of a random iteration was:

Sample GD output — Image by the author

We can see these values are pretty close to x=1 and y=0, which were indeed function minimums.

One thing I forgot to mention was the x and y initializations. I chose to randomly generate a number within random ranges. In real-world problems, using more time to think about this is always required. Same with the learning rate, the stopping condition, and many other hyperparameters.

But for our case, this was more than enough.

Variations of Gradient Descent

I’m sure you now understand the basic algorithm. However, multiple versions of it are being used out there and I think some of those are worth being mentioned.

  • Stochastic Gradient Descent (SGD). SGD is the variation that randomly picks one data point from the whole dataset at each iteration. This reduces the number of computations but it obviously has its downsides like, for example, not being able to converge to the global minimum.
  • Batch Gradient Descent (BGD). BGD uses the whole dataset in each iteration. This isn’t fully desired for big datasets as can be computationally expensive and slow but, on the other hand, the convergence to the global minimum is theoretically guaranteed.
  • Mini-Batch Gradient Descent (MBGD). This can be considered a middle point between SGD and BGD. It does not use one data point at the time nor the whole dataset, but a subset of it. On each iteration, we pick a random number of samples (previously defined) and perform the gradient descent using only those.

Conclusion

The Gradient Descent algorithm is widely used in machine and deep learning, but in other areas as well. That’s why understanding it is a must for everyone willing to become a data scientist.

I hope this post clarified what it is, what it does, and how it does it.

                    Thanks for reading the post! 
I really hope you enjoyed it and found it insightful.

Follow me for more content like this one, it helps a lot!
@polmarin

If you’d like to support me further, consider subscribing to Medium’s Membership through the link you find below: it won’t cost you any extra penny but it’ll help me through this process. Thanks a lot!

Resources

[1] Gradient descent — Wikipedia


Beginner-friendly guide

Understand the optimization algorithm widely used in Machine and Deep Learning

The slope of a mountain — Photo by Ralph (Ravi) Kayden on Unsplash

Imagine you are a drop of water on top of a mountain, and your goal is to get to the lake situated right at the base of the mountain. That tall mountain has different slopes and obstacles, so going down following a straight line might not be the best solution. How would you approach this problem? The best solution would arguably be taking little steps, one at a time, always heading toward the direction that brings you closer to your end goal.

Gradient Descent (GD) is the algorithm that does just that, and it is essential for any data scientist to understand. It’s basic and rather simple but crucial, and anyone willing to enter the field should be able to explain what it is.

In this post, my goal is to make a complete and beginner-friendly guide to make everyone understand what GD is, what’s it used for, how it works, and mention different variations of it.

As always, you’ll find the resources section at the end of the post.

But first things first.

Introduction

Using Wikipedia’s definition[1], Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Even though it’s surely not the most effective method, it’s commonly used in Machine Learning and Deep Learning, especially in Neural Networks.

It’s basically used to minimize the value of a function by updating a set of parameters on each iteration. Mathematically speaking, it uses the derivative (gradient) to gradually decrease (descent) its value.

But there’s a catch: not all functions are optimizable. We require a function — either uni or multivariate — that’s differentiable, which means derivatives exist at each point in the function’s domain, and convex (U-shape or similar).

Now, after this simple introduction, we can start digging a little bit deeper into the math behind it.

Practical Case

Because all gets clearer when going beyond the theory, let’s use real numbers and values to understand what it does.

Let’s use a common data science case in which we want to develop a regression model.

Disclaimer: I have totally invented this and there’s no logical reasoning behind using these functions, all came randomly. The goal is to show the process itself.

The cost function or loss function in any data science problem is the function we want to optimize. As we’re using regression, we’re going to use this one:

Random regression function — Image by the author

The goal is to find the optimal minimum of f(x,y). Let me plot what it looks like:

f(x,y) plotted with a=1 — Image by the author

Now our goal is to get the proper values for “x” and “y” that let us find the optimal values of this cost function. We can already see it graphically:

  • y=0
  • x being either -1 or 1

Onto the GD itself, because we want to make our machine learn to do the same.

The Algorithm

As said, gradient descent is an iterative process in which we compute the gradient and move in the opposite direction. The reasoning behind this is that the gradient of a function is used to determine the slope of that function. As we want to move down, not up, then we move in the opposite way.

It’s a simple process in which we update x and y in each iteration, by following the next approach:

Parameter update in gradient descent — Image by the author

Explained in words, at iteration k:

  1. Compute the gradient using the values of x and y at that iteration.
  2. For each of those variables — x and y — multiply its gradient times lambda (𝜆), which is a float number called the learning rate.
  3. Remove from x and y respectively the computed values in step 2.
  4. Make x and y have the new value in the next iteration.

This process is then repeated until a certain condition is met (that’s not important today). Once that happens, the training finishes and the optimization does too. We are (or should be) at a minimum (either local or global).

Now, let’s put this theory into practice.

The first thing we need to do is compute the gradient of f(x,y). The gradient corresponds to a vector of partial derivatives:

Gradient of f(x,y) — Image by the author

Now, using Python, all I’m going to do is create a loop that iteratively computes the gradient — using the corresponding x and y — and updates these parameters as specified above.

Before that, I’ll define two more values:

  • The learning rate (𝜆) can be fixed or mobile. For this simple tutorial, it’ll be 0.01.
  • I’ll also use a value called eps (epsilon) to determine when to finish iterating. Once both partial derivatives are below this threshold, the gradient descent will stop. I’m setting it to 0.0001.

Now, let’s do some code:

import random

# Define constants
eps = 0.0001
lr = 0.01

# Initialize x and y with random values
x = random.uniform(-2, 4)
y = random.uniform(-1, 1)

def f(x,y):
return (x**2 -1)**2 +y**2

def df_x(x):
return 4*x*(x**2 - 1)

def df_y(y):
return 2*y

# Perform gradient descent
while max(df_x(x), df_y(y)) >= eps:
x = x - lr * df_x(x)
y = y - lr * df_y(y)

# Print optimal values found
print(f'x = {x}, y = {y}')

The output of a random iteration was:

Sample GD output — Image by the author

We can see these values are pretty close to x=1 and y=0, which were indeed function minimums.

One thing I forgot to mention was the x and y initializations. I chose to randomly generate a number within random ranges. In real-world problems, using more time to think about this is always required. Same with the learning rate, the stopping condition, and many other hyperparameters.

But for our case, this was more than enough.

Variations of Gradient Descent

I’m sure you now understand the basic algorithm. However, multiple versions of it are being used out there and I think some of those are worth being mentioned.

  • Stochastic Gradient Descent (SGD). SGD is the variation that randomly picks one data point from the whole dataset at each iteration. This reduces the number of computations but it obviously has its downsides like, for example, not being able to converge to the global minimum.
  • Batch Gradient Descent (BGD). BGD uses the whole dataset in each iteration. This isn’t fully desired for big datasets as can be computationally expensive and slow but, on the other hand, the convergence to the global minimum is theoretically guaranteed.
  • Mini-Batch Gradient Descent (MBGD). This can be considered a middle point between SGD and BGD. It does not use one data point at the time nor the whole dataset, but a subset of it. On each iteration, we pick a random number of samples (previously defined) and perform the gradient descent using only those.

Conclusion

The Gradient Descent algorithm is widely used in machine and deep learning, but in other areas as well. That’s why understanding it is a must for everyone willing to become a data scientist.

I hope this post clarified what it is, what it does, and how it does it.

                    Thanks for reading the post! 
I really hope you enjoyed it and found it insightful.

Follow me for more content like this one, it helps a lot!
@polmarin

If you’d like to support me further, consider subscribing to Medium’s Membership through the link you find below: it won’t cost you any extra penny but it’ll help me through this process. Thanks a lot!

Resources

[1] Gradient descent — Wikipedia

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AlgorithmAprDescentGradientmachine learningMarinoptimizationPOLTech NewsTechnoblenderUnderstand
Comments (0)
Add Comment