# Gradient Descent: Optimisation and Initialisation Explained | by Jamie McGowan | Jan, 2023

## A high-level introduction to optimisation in 7 minutes

Training a deep learning model involves taking a set of model **parameters** and shifting these towards some **optimum set of values**. The optimum set of values is defined as the point at which the model becomes the best version of itself with respect to performing some task.

Intuitively, this can be thought of as when we learn a new skill. For example, if you decide to take up a new sport, the likelihood is that you’ll be pretty bad the first time you play (disregarding any child prodigies of course).

However, over time you will improve and learn to shift your own parameters (in your brain) towards some optimum value, with respect to playing this sport.

## How do these parameters move?

Let’s imagine we have a **metric value** which defines how **bad** we are at a sport. The *higher* this value is, the *worse* we are and the *lower* the value is, the *better* we are. Sort of like a handicap in Golf.

We can further imagine that shifting these parameters will have some effect on this metric i.e. as we move towards the optimum set of parameters, the metric gets **lower** (we get better at the task).

Hopefully this makes sense… but if it doesn’t then don’t worry! We’ll take a look at a diagram to try and explain this situation.

## Visualising Optimisation

Take the diagram above, our golf handicap is at **point A** (it’s pretty bad — beginner level stuff), this is where we will begin our journey to Tiger Woods level!

Which way should we move to reach point B (the professional level of Golfing)?

To the left you say? **Correct!**

Mathematically, this involves finding the **gradient** at point A and moving in the direction of *steepest descent*.

**“Wait, wait… remind me what gradients are”**

The gradient defines the steepest rate of change of a function.

Since this gradient is only calculated **locally** i.e. the gradient at point A is only correct *at* point A, so we don’t really want to trust this too far away from this point. For example, in the picture, points X and Y have very **different** gradients.

Therefore in practice, we multiply the gradient with a **learning rate** which tells us how far to move towards point B. (We will come back to this later!)

It is this **locality** argument that is one of the pillars of modern gradient descent optimisation algorithms.

**Locality and Learning Rates**

Imagine you are driving a car, you know where you want to end up, but you don’t know the way (and you don’t have a sat nav).

The best thing you can do is rely on sign posts to **guide** your way towards your goal.

However, these sign posts are only valid at the point they appear. For example, a *continue straight* instruction will not necessarily be the correct instruction 2 miles later down the road.

These sign posts are a bit like our gradient calculations in optimisation algorithms. They contain local **information** about the direction of travel (or the shape of the function) at that specific point along the journey.

Depending on how cautious (or adventurous) you are, you may prefer to have sign posts every 200 metres, or you may be happy to include them every two miles. It entirely depends on what the journey is like!

For example, if it is a long straight road, we can get away with very few sign posts. But if it is a complicated journey with many turns, we will likely need a lot **more** sign posts.

This is how we can think about **learning rates**. If we have a function like the one on the left, we may be able to use a *large* learning rate (similar to a long straight road).

However, if we have the one on the right, this will require a *smaller* learning rate as we may end up **overshooting** our destination (missing a turn).

It is also worth mentioning that it is very unlikely we can journey from A to B with a single direction at point A (unless we’re already very close). So in practice, **gradient descent** tends to be an **iterative** procedure where we receive directions at *waypoints* along the journey (*A to B* becomes *A to C to D to E to B*).

Therefore, hopefully we can build some **intuition** about how closely related the learning rate and number of waypoints are.

## Putting it all together…

OK, so hopefully we have a good idea of what **optimisation** is trying to achieve and some of the concepts we need to consider!

Using the above information, we can now define the gradient descent **algorithm**.

Going back to our picture from earlier, we will label the parameters at point A (⍬₀) and the final parameters at point B ⍬.

In the first iteration from point A to the first **waypoint** (point C) we can write down an equation to describe the parameter update. For this we will need to consider:

- The gradient of a
**performance**metric L at point A (with respect to the parameters) - A learning rate
- The initial parameters ⍬₀
- The updated parameters ⍬₁

The following parameter **updates** are calculated similarly and therefore we are able to write down the general formula as:

## Initialisation

OK, so the title promised some talk about initialisation.

For those of you absolutely fuming that nothing has been mentioned on this so far… sorry about that! But hopefully this section will satisfy you!

From all the descriptions above, it is fairly easy to think about how initialisation fits into the picture.

Earlier I mentioned some child prodigy? Let’s call her Pam. In terms of the first image in this article, this would be somewhat equivalent to Pam having some **initial** parameters at point P, not at point A. By the way, Pam is the one with the crown and the smug smile — she knows she’s good!

A high level explanation of what **initialisation** is, is where you *start* your optimisation from.

A good initialisation can take a lot of **pressure** off optimisation algorithms and a good optimisation algorithm can do the **same** for initialisation. In practice, a good initialisation can save hundreds of hours of compute time when training a deep learning model with many parameters.

Due to this, there are many different areas of research solely focussed on developing better **initialisation** **techniques**. The reason this is very difficult is that, in essence, it’s like trying to **predict** the future without knowing much about the environment in which we are in.

Another reason initialisation is important to consider is related to where we might end up *after* optimisation.

Consider this new optimisation surface above. This has many **different** **minima — **some of which are better than others!

Clearly in this picture, our starting point will **heavily** affect where we end up. This is one of the reasons why it is so important for ML practitioners to experiment with **different** initialisations, as well as tuning hyperparameters (such as the learning rate) when attempting to find the best model for a specific task.

## Conclusion

In this article we have walked through some high-level **explanations** related to gradient descent, optimisation and initialisation. We have visualised the goals of optimisation and initialisation, investigated these **graphically**, introduced the concept of a learning rate, and even wrote down a formula for gradient descent!

Hopefully this has helped build your **intuition** behind these important concepts and solidified your understanding of where the formula for gradient descent comes from!

Thanks for reading and stay tuned for more articles related to **optimisation techniques**!

As always, let me know of any issues or comments!

## A high-level introduction to optimisation in 7 minutes

Training a deep learning model involves taking a set of model **parameters** and shifting these towards some **optimum set of values**. The optimum set of values is defined as the point at which the model becomes the best version of itself with respect to performing some task.

Intuitively, this can be thought of as when we learn a new skill. For example, if you decide to take up a new sport, the likelihood is that you’ll be pretty bad the first time you play (disregarding any child prodigies of course).

However, over time you will improve and learn to shift your own parameters (in your brain) towards some optimum value, with respect to playing this sport.

## How do these parameters move?

Let’s imagine we have a **metric value** which defines how **bad** we are at a sport. The *higher* this value is, the *worse* we are and the *lower* the value is, the *better* we are. Sort of like a handicap in Golf.

We can further imagine that shifting these parameters will have some effect on this metric i.e. as we move towards the optimum set of parameters, the metric gets **lower** (we get better at the task).

Hopefully this makes sense… but if it doesn’t then don’t worry! We’ll take a look at a diagram to try and explain this situation.

## Visualising Optimisation

Take the diagram above, our golf handicap is at **point A** (it’s pretty bad — beginner level stuff), this is where we will begin our journey to Tiger Woods level!

Which way should we move to reach point B (the professional level of Golfing)?

To the left you say? **Correct!**

Mathematically, this involves finding the **gradient** at point A and moving in the direction of *steepest descent*.

**“Wait, wait… remind me what gradients are”**

The gradient defines the steepest rate of change of a function.

Since this gradient is only calculated **locally** i.e. the gradient at point A is only correct *at* point A, so we don’t really want to trust this too far away from this point. For example, in the picture, points X and Y have very **different** gradients.

Therefore in practice, we multiply the gradient with a **learning rate** which tells us how far to move towards point B. (We will come back to this later!)

It is this **locality** argument that is one of the pillars of modern gradient descent optimisation algorithms.

**Locality and Learning Rates**

Imagine you are driving a car, you know where you want to end up, but you don’t know the way (and you don’t have a sat nav).

The best thing you can do is rely on sign posts to **guide** your way towards your goal.

However, these sign posts are only valid at the point they appear. For example, a *continue straight* instruction will not necessarily be the correct instruction 2 miles later down the road.

These sign posts are a bit like our gradient calculations in optimisation algorithms. They contain local **information** about the direction of travel (or the shape of the function) at that specific point along the journey.

Depending on how cautious (or adventurous) you are, you may prefer to have sign posts every 200 metres, or you may be happy to include them every two miles. It entirely depends on what the journey is like!

For example, if it is a long straight road, we can get away with very few sign posts. But if it is a complicated journey with many turns, we will likely need a lot **more** sign posts.

This is how we can think about **learning rates**. If we have a function like the one on the left, we may be able to use a *large* learning rate (similar to a long straight road).

However, if we have the one on the right, this will require a *smaller* learning rate as we may end up **overshooting** our destination (missing a turn).

It is also worth mentioning that it is very unlikely we can journey from A to B with a single direction at point A (unless we’re already very close). So in practice, **gradient descent** tends to be an **iterative** procedure where we receive directions at *waypoints* along the journey (*A to B* becomes *A to C to D to E to B*).

Therefore, hopefully we can build some **intuition** about how closely related the learning rate and number of waypoints are.

## Putting it all together…

OK, so hopefully we have a good idea of what **optimisation** is trying to achieve and some of the concepts we need to consider!

Using the above information, we can now define the gradient descent **algorithm**.

Going back to our picture from earlier, we will label the parameters at point A (⍬₀) and the final parameters at point B ⍬.

In the first iteration from point A to the first **waypoint** (point C) we can write down an equation to describe the parameter update. For this we will need to consider:

- The gradient of a
**performance**metric L at point A (with respect to the parameters) - A learning rate
- The initial parameters ⍬₀
- The updated parameters ⍬₁

The following parameter **updates** are calculated similarly and therefore we are able to write down the general formula as:

## Initialisation

OK, so the title promised some talk about initialisation.

For those of you absolutely fuming that nothing has been mentioned on this so far… sorry about that! But hopefully this section will satisfy you!

From all the descriptions above, it is fairly easy to think about how initialisation fits into the picture.

Earlier I mentioned some child prodigy? Let’s call her Pam. In terms of the first image in this article, this would be somewhat equivalent to Pam having some **initial** parameters at point P, not at point A. By the way, Pam is the one with the crown and the smug smile — she knows she’s good!

A high level explanation of what **initialisation** is, is where you *start* your optimisation from.

A good initialisation can take a lot of **pressure** off optimisation algorithms and a good optimisation algorithm can do the **same** for initialisation. In practice, a good initialisation can save hundreds of hours of compute time when training a deep learning model with many parameters.

Due to this, there are many different areas of research solely focussed on developing better **initialisation** **techniques**. The reason this is very difficult is that, in essence, it’s like trying to **predict** the future without knowing much about the environment in which we are in.

Another reason initialisation is important to consider is related to where we might end up *after* optimisation.

Consider this new optimisation surface above. This has many **different** **minima — **some of which are better than others!

Clearly in this picture, our starting point will **heavily** affect where we end up. This is one of the reasons why it is so important for ML practitioners to experiment with **different** initialisations, as well as tuning hyperparameters (such as the learning rate) when attempting to find the best model for a specific task.

## Conclusion

In this article we have walked through some high-level **explanations** related to gradient descent, optimisation and initialisation. We have visualised the goals of optimisation and initialisation, investigated these **graphically**, introduced the concept of a learning rate, and even wrote down a formula for gradient descent!

Hopefully this has helped build your **intuition** behind these important concepts and solidified your understanding of where the formula for gradient descent comes from!

Thanks for reading and stay tuned for more articles related to **optimisation techniques**!

As always, let me know of any issues or comments!

**Denial of responsibility!**Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.