Techno Blender
Digitally Yours.
0 17

## To even better understand Gradient Boosting

Gradient Boosting is an ensemble method that is usually applied to decision trees. It is so often that we usually say Gradient Boosting to refer to Gradient Boosted Decision Trees. However, as an ensemble method, it is possible to apply it to other base models such as linear regression. But there is a trivial conclusion that you may already know:

Gradient Boosted Linear Regression is, well, Linear Regression.

But it is still interesting to implement it and moreover, we will do it in Excel, so even if you are not familiar to program complex algorithms, you can still understand the algorithmic steps.

I wrote an article to always distinguish three steps of machine learning to learn it in an effective way, and let’s apply the principle to Gradient Boosted Linear Regression, here are the three steps:

• 1. Model: Linear Regression is a machine learning model in the sense that it takes input (features) to predict an output
• 1bis. Ensemble method: Gradient Boosting is an ensemble method and it is not a model itself (in the sense that it does not take inputs to predict output for a target variable). It has to be applied to some base model to create a meta-model. Here we will create a meta-model that is Gradient Boosted Linear Regression.
• 2. Model fitting: Linear Regression has to be fitted, which means their coefficients have to be optimized for a given training dataset. Gradient Descent is a fitting algorithm that can be applied to linear regression. But it is not the only one. In the case of linear regression, there is an exact solution that can be expressed in mathematical formulas. It is also worth noting that there is no fitting algorithm for ensemble methods.
• 3. Model tuning consists of optimizing the hyperparameters of the model or the meta-model. Here, we will encounter two: the learning rate of the gradient boosting algorithm, and the number of steps.

It was machine learning made easy in three steps!

Here we will use a simple linear regression as the base model, with a simple dataset of ten observations. We will focus on the gradient boosting part, and for the fitting, we will use a function in Google Sheet (it also works in Excel): LINEST to estimate the coefficients of the linear regression.

I will use a Google Sheet to demonstrate the implementation process for gradient boosting in this article. If you’d like to access this sheet, as well as others I’ve developed — such as linear regression with gradient descent, logistic regression, neural networks with backpropagation, KNN, k-means, and more to come — please consider supporting me on Ko-fi. You can find all of these resources at the following link: https://ko-fi.com/s/4ddca6dff1

Here are the main steps of the Gradient Boosting algorithm

1. Initialization: we will choose the average value to be the first step of the gradient boosting algorithm.
2. Residual Errors Calculation: we calculate the residual errors between the predicted values (for the first step, it is the average value) and the actual values in the training data.
3. Fitting Linear Regression to Residuals: we create a linear regression model to fit the residuals.
4. Adding the New Model to the Ensemble: Combine the previous model and the new model to create a new ensemble model. Here, we have to apply a learning rate or shrinkage as a hyperparameter to the weak model.
5. Repeating the process: Repeat steps 2–4 until you have reached the specified number of boosting stages or until the error has converged.

That’s it! This is the basic procedure for performing a Gradient Boosting applied to Linear Regression. I wanted to keep the description in simple way, and here we can write some equations to illustrate each step:

• Step1: f0 = average value of actual y
• Step 2: resd1 = y — f0
• Step 3: resdfit1 = a0 x + b0 to predict y — f0
• Step 4: f1 = f0 — learning_rate * (a0 x + b0)
• Step 2–2: resd2 = y — f1
• Step 3–2: resdfit2 = a1 x + b1 to predict y — f1
• Step 4–2: f2 = f1-learning_rate * (a1 x + b1) which can be developed as : f0 — learning_rate * (a0 x + b0) — learning_rate * (a1 x + b1)

If you look carefully at the algorithm of the previous section, you may notice two bizarre things.

First, in step 2, we fit a linear regression to residuals, it will take time and algorithmic steps to achieve the model fitting steps, instead of fitting a linear regression to residuals, we can directly fit a linear regression to the actual values of y and we already would find the final optimal model!

Secondly, when adding a linear regression to another linear regression, it is still a linear regression.

For example, we can rewrite f2 as: f2 = f0 — learning_rate *(b0+b1) — learning_rate * (a0+a1) x

It is a linear regression!

For decision trees, these two bizarre things won’t happen, since adding a tree to another is not the same as growing a tree a step further.

Before we go to the implementation part, one more question: what if we set the learning rate to 1? What happens to Gradient Boosted Linear Regression?

The implementation of these formulas is straightforward in Google Sheet or Excel.

The table below shows the training dataset along with the different steps of the gradient boosting steps

For each fitting step, we use the Excel function LINEST

We will only do 2 iterations and we can guess how it goes for more iterations. Here below is a graphic to show the models at each iteration. The different shades of red illustrate the convergence of the model and we also show the final model that is directly found with gradient descent applied directly to y.

There are two hyperparameters we can tune: the number of iterations and the learning rate.

For the number of iterations, we only implemented two, but it is easy to imagine more and we can stop by examining the magnitude of the residuals.

For the learning rate, we can change it in Google Sheet and see what happens. When the learning rate is small, the “learning process” will be slow. And if the learning rate is 1, we can see that the convergence is achieved at step 1.

And the residuals of iteration 1 are already zeros.

If the learning rate is higher than 1, then the model will diverge.

How the learning rate and the number of iterations work in Gradient Boosting is very similar to Gradient Descent… Oh wait, they are very similar and they actually are the same algorithm! in the sense that in the case of Classic Gradient Descent, the algorithm is applied to parameters of the model such as the weights or coefficients of the linear regression. And in the case of Gradient Boosting, the algorithm is applied to models.

Even the word “boosting” only means “adding” and it is the exact same procedure in the classic gradient descent algorithm which consists of adding the descent step by step from the initial (randomly chosen) starting points.

I hope that you gained more insights into how Gradient Boosting works. Here are the main takeaways.

• Excel is an excellent way to understand how algorithms work.
• Gradient Boosting is an Ensemble Method that can be applied to any base model.
• Gradient Boosting is Gradient Descent in the sense that they are the same algorithm but applied to different objects: parameters vs. functions or models.
• Gradient Boosting can be applied to Linear Regression but it is only for the purpose to understand the algorithm because in practice you don’t need to because Gradient Boosted Linear Regression is Linear Regression.

## To even better understand Gradient Boosting

Gradient Boosting is an ensemble method that is usually applied to decision trees. It is so often that we usually say Gradient Boosting to refer to Gradient Boosted Decision Trees. However, as an ensemble method, it is possible to apply it to other base models such as linear regression. But there is a trivial conclusion that you may already know:

Gradient Boosted Linear Regression is, well, Linear Regression.

But it is still interesting to implement it and moreover, we will do it in Excel, so even if you are not familiar to program complex algorithms, you can still understand the algorithmic steps.

I wrote an article to always distinguish three steps of machine learning to learn it in an effective way, and let’s apply the principle to Gradient Boosted Linear Regression, here are the three steps:

• 1. Model: Linear Regression is a machine learning model in the sense that it takes input (features) to predict an output
• 1bis. Ensemble method: Gradient Boosting is an ensemble method and it is not a model itself (in the sense that it does not take inputs to predict output for a target variable). It has to be applied to some base model to create a meta-model. Here we will create a meta-model that is Gradient Boosted Linear Regression.
• 2. Model fitting: Linear Regression has to be fitted, which means their coefficients have to be optimized for a given training dataset. Gradient Descent is a fitting algorithm that can be applied to linear regression. But it is not the only one. In the case of linear regression, there is an exact solution that can be expressed in mathematical formulas. It is also worth noting that there is no fitting algorithm for ensemble methods.
• 3. Model tuning consists of optimizing the hyperparameters of the model or the meta-model. Here, we will encounter two: the learning rate of the gradient boosting algorithm, and the number of steps.

It was machine learning made easy in three steps!

Here we will use a simple linear regression as the base model, with a simple dataset of ten observations. We will focus on the gradient boosting part, and for the fitting, we will use a function in Google Sheet (it also works in Excel): LINEST to estimate the coefficients of the linear regression.

I will use a Google Sheet to demonstrate the implementation process for gradient boosting in this article. If you’d like to access this sheet, as well as others I’ve developed — such as linear regression with gradient descent, logistic regression, neural networks with backpropagation, KNN, k-means, and more to come — please consider supporting me on Ko-fi. You can find all of these resources at the following link: https://ko-fi.com/s/4ddca6dff1

Here are the main steps of the Gradient Boosting algorithm

1. Initialization: we will choose the average value to be the first step of the gradient boosting algorithm.
2. Residual Errors Calculation: we calculate the residual errors between the predicted values (for the first step, it is the average value) and the actual values in the training data.
3. Fitting Linear Regression to Residuals: we create a linear regression model to fit the residuals.
4. Adding the New Model to the Ensemble: Combine the previous model and the new model to create a new ensemble model. Here, we have to apply a learning rate or shrinkage as a hyperparameter to the weak model.
5. Repeating the process: Repeat steps 2–4 until you have reached the specified number of boosting stages or until the error has converged.

That’s it! This is the basic procedure for performing a Gradient Boosting applied to Linear Regression. I wanted to keep the description in simple way, and here we can write some equations to illustrate each step:

• Step1: f0 = average value of actual y
• Step 2: resd1 = y — f0
• Step 3: resdfit1 = a0 x + b0 to predict y — f0
• Step 4: f1 = f0 — learning_rate * (a0 x + b0)
• Step 2–2: resd2 = y — f1
• Step 3–2: resdfit2 = a1 x + b1 to predict y — f1
• Step 4–2: f2 = f1-learning_rate * (a1 x + b1) which can be developed as : f0 — learning_rate * (a0 x + b0) — learning_rate * (a1 x + b1)

If you look carefully at the algorithm of the previous section, you may notice two bizarre things.

First, in step 2, we fit a linear regression to residuals, it will take time and algorithmic steps to achieve the model fitting steps, instead of fitting a linear regression to residuals, we can directly fit a linear regression to the actual values of y and we already would find the final optimal model!

Secondly, when adding a linear regression to another linear regression, it is still a linear regression.

For example, we can rewrite f2 as: f2 = f0 — learning_rate *(b0+b1) — learning_rate * (a0+a1) x

It is a linear regression!

For decision trees, these two bizarre things won’t happen, since adding a tree to another is not the same as growing a tree a step further.

Before we go to the implementation part, one more question: what if we set the learning rate to 1? What happens to Gradient Boosted Linear Regression?

The implementation of these formulas is straightforward in Google Sheet or Excel.

The table below shows the training dataset along with the different steps of the gradient boosting steps

For each fitting step, we use the Excel function LINEST

We will only do 2 iterations and we can guess how it goes for more iterations. Here below is a graphic to show the models at each iteration. The different shades of red illustrate the convergence of the model and we also show the final model that is directly found with gradient descent applied directly to y.

There are two hyperparameters we can tune: the number of iterations and the learning rate.

For the number of iterations, we only implemented two, but it is easy to imagine more and we can stop by examining the magnitude of the residuals.

For the learning rate, we can change it in Google Sheet and see what happens. When the learning rate is small, the “learning process” will be slow. And if the learning rate is 1, we can see that the convergence is achieved at step 1.

And the residuals of iteration 1 are already zeros.

If the learning rate is higher than 1, then the model will diverge.

How the learning rate and the number of iterations work in Gradient Boosting is very similar to Gradient Descent… Oh wait, they are very similar and they actually are the same algorithm! in the sense that in the case of Classic Gradient Descent, the algorithm is applied to parameters of the model such as the weights or coefficients of the linear regression. And in the case of Gradient Boosting, the algorithm is applied to models.

Even the word “boosting” only means “adding” and it is the exact same procedure in the classic gradient descent algorithm which consists of adding the descent step by step from the initial (randomly chosen) starting points.

I hope that you gained more insights into how Gradient Boosting works. Here are the main takeaways.

• Excel is an excellent way to understand how algorithms work.
• Gradient Boosting is an Ensemble Method that can be applied to any base model.
• Gradient Boosting is Gradient Descent in the sense that they are the same algorithm but applied to different objects: parameters vs. functions or models.
• Gradient Boosting can be applied to Linear Regression but it is only for the purpose to understand the algorithm because in practice you don’t need to because Gradient Boosted Linear Regression is Linear Regression.