Strength in Numbers: Why Does a Gradient Boosting Machine Work So Well? | by Paul Hiemstra | Dec, 2022

By Jessie Hobb On Dec 8, 2022

Where we learn why solving complex problems using simple basis functions is such a powerful concept

Gradient Boosting algorithms such as xgboost are among the best performing models for tabular data. Together with other models such as Random Forests, gradient boosting fall under the category of ensemble models. The name derives from a core feature of this category: they do not fit a single large model, but a whole ensemble of models that together comprise the model. Ensemble models are strongly linked to the concept of a basis function. Both use simpler building blocks that combine to solve a much more complex problem. In this article I will start out by introducing the concept of basis functions, and then expand how basis functions apply to gradient boosting models. Finally I elaborate on a key insight: the basis functions in gradient boosting need to be relatively simple for the algorithm to work effectively.

This article is heavily inspired by chapters 8 to 10 of the excellent ‘The Elements of Statistical Learning’ (EOSL) book by Hastie et al. So, for a much more in-depth discussion, including all the math, I gladly refer the reader to this book.

At its most basic, any supervised learning algorithm tries to fit a function between features and labels. In stead of using one big function, we can construct this function using many smaller underlying basis functions. In the following example I want to illustrate the concept of a basis function. The goal is to to reconstruct a timeseries function using sine waves. Below we have the underlying ‘real’ function comb that is a combination of a 5 Hz and 2 Hz signal, and the observations we drew from it (black dots):

import numpy as np
import pandas as pd
from plotnine import *np.random.seed(1234)
time_step = 0.02
time_vec = np.arange(0, 5, time_step)
sine_wave = np.sin(2 * np.pi * 5 * time_vec)
sine_wave2 = np.sin(2 * np.pi * 2 * time_vec)
reference_data = pd.DataFrame({'5hz': sine_wave, '2hz': sine_wave2, 'comb': sine_wave + sine_wave2, 'time': time_vec})
observations = reference_data.sample(100)
(
ggplot(aes(x='time')) 
+ geom_point(observations, aes(y='comb')) 
+ geom_line(reference_data.melt(id_vars='time',  value_vars=['5hz', '2hz', 'comb']), aes(y='value', color='variable'))
+ facet_wrap('~variable', ncol=1)
)

A combined sine wave and the component waves that make up that combined wave (self created)

A simple approach to reconstructing our underlying function comb would be to use a single sine function: we can change the frequency and amplitude of the sine wave. But whatever we do, no single sine wave will perfectly fit the observations. This is logical, as the observations are drawn from the combined 5 Hz and 2 Hz line.

To get a good fit, we need to fit a model that combines the 2Hz and 5Hz functions. In this case the 2Hz and 5Hz functions are the basis functions from which we construct the overall function. Which basis functions apply to this signal can be determined using Fourier Analysis. In a sense, we use an ensemble of sine functions to construct one overall more complex function. In addition to Fourier Transforms, basis functions are used in wavelets, Neural Networks, Principal Component Analysis, and Taylor expansions.

The basis functions used in gradient boosting models are small decision trees. The first basis function is the best single node tree, i.e. the model consists of a single yes/no decision tree (EOSL, page 360). Consecutive basis functions take the form of trees of size 𝐽J, and are built on the residuals of the previously fitted basis functions. So, the first subtree explains a little bit of the variance in the data, and each additional subtree explains more and more of the variance in the data. Each individual subtree is not very effective in modeling the data, as we saw with our 2Hz and 5Hz example, and is thus considered a weak learner. The ensemble of weak learners however is a very strong learner, which is the core mechanism on which gradient boostings success is based.

Note that we deliberately create weak learners by limiting the size of the subtree to size 𝐽J (EOSL, page 361, section 10.11). This prevents the first batch of trees from becoming very large, limiting the flexibility we have in fitting later subtrees. We want our subtrees to be weak, and let the boosting algorithm do much of the heavy lifting in favor of using large individual subtrees. This is very much in line with how neural nets solve problems: very simple mathematical operations that in a large ensemble solve complex problems. The complex solution comes as emergent behavior from the underlying simple basis functions.

How a gradient boosting machine calculates the residuals after each subtree sheds further light on why weak learners are a feature and not a bug. My explanation here is heavily inspired by this SO post and gets a bit mathematical, so bear with me. Let’s assume that for a given observation we have a true value 𝑦 and a value 𝑦̂ that contains the total prediction of all the previous subtrees we already fitted. Next we construct a new subtree that yields a result 𝑧, ending up with the new total prediction 𝑦̂ +𝑧. The new subtree should minimize the loss between 𝑦̂ +𝑧 and 𝑦. Essentially, 𝑧=𝑦−𝑦̂would give us the perfect answer.

If we use the MSE as our loss function((𝐿=1/2(𝑦−𝑦̂ )^2), the gradient of this loss function relative to the new prediction is 𝑦−𝑦̂ =−∂𝐿/∂𝑦̂ . We already showed that 𝑦−𝑦̂ is the optimal answer, so this gradient also provides the optimal answer. By using this gradient as the residuals the new subtree should fit to, we effectively perform gradient descent in the prediction space of 𝑦̂: which smallest change in 𝑦̂ leads to the greatest reduction in the loss function. This is where the gradient in gradient boosting refers to: using a series of weak learners that slowly build towards a good overall model. Each subtree provides the best step towards minimizing our loss, i.e. tackling the largest issue the model is struggling with at that step in the boosting process. I refer to this link for a more mathematical derivation of this property of gradient boosting.

Keeping our steps in the prediction space small during boosting, i.e. using small subtrees, ensures we slowly probe the prediction space surface and don’t get stuck in a local minimum. In addition, using small subtrees ensures our consecutive basis functions can specialise in solving for specific challenging observations. Using very large trees would promote focusing on overall mean performance in stead of focusing attention on smaller details. Using small trees, and thus weak learners, allows gradient boosting to be very flexible in how to solve the given problem.

The core concept underlying gradient boosting is to boost the effectiveness of a consecutive series (or ensemble) of weak leaners by choosing the next weak learner that minimizes the loss towards the true value. Each of the subtrees in the ensemble needs to be relatively weak to allow gradient descent to work towards a good solution flexibly. This makes gradient boosting a very effective method that often works quite well out-of-the-box without a lot of tuning.

This article can also be read on github, including all the code.

My name is Paul Hiemstra, and I work as a teacher and data scientist in the Netherlands. I am a mix between a scientist and a software engineer, and have a broad interest in everything related to data science. You can follow me here on medium, or on LinkedIn.

If you enjoyed this article, you might also enjoy some of my other articles:

Where we learn why solving complex problems using simple basis functions is such a powerful concept

import numpy as np
import pandas as pd
from plotnine import *np.random.seed(1234)
time_step = 0.02
time_vec = np.arange(0, 5, time_step)
sine_wave = np.sin(2 * np.pi * 5 * time_vec)
sine_wave2 = np.sin(2 * np.pi * 2 * time_vec)
reference_data = pd.DataFrame({'5hz': sine_wave, '2hz': sine_wave2, 'comb': sine_wave + sine_wave2, 'time': time_vec})
observations = reference_data.sample(100)
(
ggplot(aes(x='time')) 
+ geom_point(observations, aes(y='comb')) 
+ geom_line(reference_data.melt(id_vars='time',  value_vars=['5hz', '2hz', 'comb']), aes(y='value', color='variable'))
+ facet_wrap('~variable', ncol=1)
)

This article can also be read on github, including all the code.

If you enjoyed this article, you might also enjoy some of my other articles:

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.