Techno Blender
Digitally Yours.
0 15

Training a deep learning model involves taking a set of model parameters and shifting these towards some optimum set of values. The optimum set of values is defined as the point at which the model becomes the best version of itself with respect to performing some task.

Intuitively, this can be thought of as when we learn a new skill. For example, if you decide to take up a new sport, the likelihood is that you’ll be pretty bad the first time you play (disregarding any child prodigies of course).

However, over time you will improve and learn to shift your own parameters (in your brain) towards some optimum value, with respect to playing this sport.

## How do these parameters move?

Let’s imagine we have a metric value which defines how bad we are at a sport. The higher this value is, the worse we are and the lower the value is, the better we are. Sort of like a handicap in Golf.

We can further imagine that shifting these parameters will have some effect on this metric i.e. as we move towards the optimum set of parameters, the metric gets lower (we get better at the task).

Hopefully this makes sense… but if it doesn’t then don’t worry! We’ll take a look at a diagram to try and explain this situation.

## Visualising Optimisation

Take the diagram above, our golf handicap is at point A (it’s pretty bad — beginner level stuff), this is where we will begin our journey to Tiger Woods level!

Which way should we move to reach point B (the professional level of Golfing)?

To the left you say? Correct!

Mathematically, this involves finding the gradient at point A and moving in the direction of steepest descent.

“Wait, wait… remind me what gradients are”

The gradient defines the steepest rate of change of a function.

Since this gradient is only calculated locally i.e. the gradient at point A is only correct at point A, so we don’t really want to trust this too far away from this point. For example, in the picture, points X and Y have very different gradients.

Therefore in practice, we multiply the gradient with a learning rate which tells us how far to move towards point B. (We will come back to this later!)

It is this locality argument that is one of the pillars of modern gradient descent optimisation algorithms.

## Locality and Learning Rates

Imagine you are driving a car, you know where you want to end up, but you don’t know the way (and you don’t have a sat nav).

The best thing you can do is rely on sign posts to guide your way towards your goal.

However, these sign posts are only valid at the point they appear. For example, a continue straight instruction will not necessarily be the correct instruction 2 miles later down the road.

These sign posts are a bit like our gradient calculations in optimisation algorithms. They contain local information about the direction of travel (or the shape of the function) at that specific point along the journey.

Depending on how cautious (or adventurous) you are, you may prefer to have sign posts every 200 metres, or you may be happy to include them every two miles. It entirely depends on what the journey is like!

For example, if it is a long straight road, we can get away with very few sign posts. But if it is a complicated journey with many turns, we will likely need a lot more sign posts.

This is how we can think about learning rates. If we have a function like the one on the left, we may be able to use a large learning rate (similar to a long straight road).

However, if we have the one on the right, this will require a smaller learning rate as we may end up overshooting our destination (missing a turn).

It is also worth mentioning that it is very unlikely we can journey from A to B with a single direction at point A (unless we’re already very close). So in practice, gradient descent tends to be an iterative procedure where we receive directions at waypoints along the journey (A to B becomes A to C to D to E to B).

Therefore, hopefully we can build some intuition about how closely related the learning rate and number of waypoints are.

## Putting it all together…

OK, so hopefully we have a good idea of what optimisation is trying to achieve and some of the concepts we need to consider!

Using the above information, we can now define the gradient descent algorithm.

Going back to our picture from earlier, we will label the parameters at point A (⍬₀) and the final parameters at point B ⍬.

In the first iteration from point A to the first waypoint (point C) we can write down an equation to describe the parameter update. For this we will need to consider:

• The gradient of a performance metric L at point A (with respect to the parameters)
• A learning rate
• The initial parameters ⍬₀
• The updated parameters ⍬₁

The following parameter updates are calculated similarly and therefore we are able to write down the general formula as:

## Initialisation

OK, so the title promised some talk about initialisation.

For those of you absolutely fuming that nothing has been mentioned on this so far… sorry about that! But hopefully this section will satisfy you!

From all the descriptions above, it is fairly easy to think about how initialisation fits into the picture.

Earlier I mentioned some child prodigy? Let’s call her Pam. In terms of the first image in this article, this would be somewhat equivalent to Pam having some initial parameters at point P, not at point A. By the way, Pam is the one with the crown and the smug smile — she knows she’s good!

A high level explanation of what initialisation is, is where you start your optimisation from.

A good initialisation can take a lot of pressure off optimisation algorithms and a good optimisation algorithm can do the same for initialisation. In practice, a good initialisation can save hundreds of hours of compute time when training a deep learning model with many parameters.

Due to this, there are many different areas of research solely focussed on developing better initialisation techniques. The reason this is very difficult is that, in essence, it’s like trying to predict the future without knowing much about the environment in which we are in.

Another reason initialisation is important to consider is related to where we might end up after optimisation.

Consider this new optimisation surface above. This has many different minima — some of which are better than others!

Clearly in this picture, our starting point will heavily affect where we end up. This is one of the reasons why it is so important for ML practitioners to experiment with different initialisations, as well as tuning hyperparameters (such as the learning rate) when attempting to find the best model for a specific task.

## Conclusion

In this article we have walked through some high-level explanations related to gradient descent, optimisation and initialisation. We have visualised the goals of optimisation and initialisation, investigated these graphically, introduced the concept of a learning rate, and even wrote down a formula for gradient descent!

Hopefully this has helped build your intuition behind these important concepts and solidified your understanding of where the formula for gradient descent comes from!

Thanks for reading and stay tuned for more articles related to optimisation techniques!

As always, let me know of any issues or comments!

Training a deep learning model involves taking a set of model parameters and shifting these towards some optimum set of values. The optimum set of values is defined as the point at which the model becomes the best version of itself with respect to performing some task.

Intuitively, this can be thought of as when we learn a new skill. For example, if you decide to take up a new sport, the likelihood is that you’ll be pretty bad the first time you play (disregarding any child prodigies of course).

However, over time you will improve and learn to shift your own parameters (in your brain) towards some optimum value, with respect to playing this sport.

## How do these parameters move?

Let’s imagine we have a metric value which defines how bad we are at a sport. The higher this value is, the worse we are and the lower the value is, the better we are. Sort of like a handicap in Golf.

We can further imagine that shifting these parameters will have some effect on this metric i.e. as we move towards the optimum set of parameters, the metric gets lower (we get better at the task).

Hopefully this makes sense… but if it doesn’t then don’t worry! We’ll take a look at a diagram to try and explain this situation.

## Visualising Optimisation

Take the diagram above, our golf handicap is at point A (it’s pretty bad — beginner level stuff), this is where we will begin our journey to Tiger Woods level!

Which way should we move to reach point B (the professional level of Golfing)?

To the left you say? Correct!

Mathematically, this involves finding the gradient at point A and moving in the direction of steepest descent.

“Wait, wait… remind me what gradients are”

The gradient defines the steepest rate of change of a function.

Since this gradient is only calculated locally i.e. the gradient at point A is only correct at point A, so we don’t really want to trust this too far away from this point. For example, in the picture, points X and Y have very different gradients.

Therefore in practice, we multiply the gradient with a learning rate which tells us how far to move towards point B. (We will come back to this later!)

It is this locality argument that is one of the pillars of modern gradient descent optimisation algorithms.

## Locality and Learning Rates

Imagine you are driving a car, you know where you want to end up, but you don’t know the way (and you don’t have a sat nav).

The best thing you can do is rely on sign posts to guide your way towards your goal.

However, these sign posts are only valid at the point they appear. For example, a continue straight instruction will not necessarily be the correct instruction 2 miles later down the road.

These sign posts are a bit like our gradient calculations in optimisation algorithms. They contain local information about the direction of travel (or the shape of the function) at that specific point along the journey.

Depending on how cautious (or adventurous) you are, you may prefer to have sign posts every 200 metres, or you may be happy to include them every two miles. It entirely depends on what the journey is like!

For example, if it is a long straight road, we can get away with very few sign posts. But if it is a complicated journey with many turns, we will likely need a lot more sign posts.

This is how we can think about learning rates. If we have a function like the one on the left, we may be able to use a large learning rate (similar to a long straight road).

However, if we have the one on the right, this will require a smaller learning rate as we may end up overshooting our destination (missing a turn).

It is also worth mentioning that it is very unlikely we can journey from A to B with a single direction at point A (unless we’re already very close). So in practice, gradient descent tends to be an iterative procedure where we receive directions at waypoints along the journey (A to B becomes A to C to D to E to B).

Therefore, hopefully we can build some intuition about how closely related the learning rate and number of waypoints are.

## Putting it all together…

OK, so hopefully we have a good idea of what optimisation is trying to achieve and some of the concepts we need to consider!

Using the above information, we can now define the gradient descent algorithm.

Going back to our picture from earlier, we will label the parameters at point A (⍬₀) and the final parameters at point B ⍬.

In the first iteration from point A to the first waypoint (point C) we can write down an equation to describe the parameter update. For this we will need to consider:

• The gradient of a performance metric L at point A (with respect to the parameters)
• A learning rate
• The initial parameters ⍬₀
• The updated parameters ⍬₁

The following parameter updates are calculated similarly and therefore we are able to write down the general formula as:

## Initialisation

OK, so the title promised some talk about initialisation.

For those of you absolutely fuming that nothing has been mentioned on this so far… sorry about that! But hopefully this section will satisfy you!

From all the descriptions above, it is fairly easy to think about how initialisation fits into the picture.

Earlier I mentioned some child prodigy? Let’s call her Pam. In terms of the first image in this article, this would be somewhat equivalent to Pam having some initial parameters at point P, not at point A. By the way, Pam is the one with the crown and the smug smile — she knows she’s good!

A high level explanation of what initialisation is, is where you start your optimisation from.

A good initialisation can take a lot of pressure off optimisation algorithms and a good optimisation algorithm can do the same for initialisation. In practice, a good initialisation can save hundreds of hours of compute time when training a deep learning model with many parameters.

Due to this, there are many different areas of research solely focussed on developing better initialisation techniques. The reason this is very difficult is that, in essence, it’s like trying to predict the future without knowing much about the environment in which we are in.

Another reason initialisation is important to consider is related to where we might end up after optimisation.

Consider this new optimisation surface above. This has many different minima — some of which are better than others!

Clearly in this picture, our starting point will heavily affect where we end up. This is one of the reasons why it is so important for ML practitioners to experiment with different initialisations, as well as tuning hyperparameters (such as the learning rate) when attempting to find the best model for a specific task.

## Conclusion

In this article we have walked through some high-level explanations related to gradient descent, optimisation and initialisation. We have visualised the goals of optimisation and initialisation, investigated these graphically, introduced the concept of a learning rate, and even wrote down a formula for gradient descent!

Hopefully this has helped build your intuition behind these important concepts and solidified your understanding of where the formula for gradient descent comes from!

Thanks for reading and stay tuned for more articles related to optimisation techniques!

As always, let me know of any issues or comments!