XGBoost: Theory and Hyperparameter Tuning | by Jorge Martín Lasaosa | Feb, 2023

By Jessie Hobb On Feb 16, 2023

A complete guide with examples in Python

In a few months, I will have been working as a Data Scientist for 3 years. I know it is not a long career yet, but together with my academic experience, I have been able to work on several machine learning projects for different sectors (energy, customer experience…). All of them were fed by tabular data, which means structured data (organised in rows and columns). In contrast, there are projects fed by unstructured data such as images or text which are more related to machine learning fields such as Computer Vision or Natural Language Processing (NLP).

Based on my experience, XGBoost usually performs well with tabular data projects. Although the No Free Lunch Theorem [1] states that any two algorithms are equivalent when their performances are averaged across all possible problems, on Bojan Tunguz’s Twitter [2] you can read frequent discussions with other professionals about why tree-based models (and specially XGBoost) are often the best candidates for tackling tabular data projects, even with the growing research into the use of Deep Learning techniques for this type of data. [3]

Also, it is quite funny to see how a Kaggle Grandmaster [4] jokes about being an XGBoost evangelist.

Bojan Tunguz’s pinned tweet.

Despite the great success of XGBoost, when I wanted to get the best performance out of the model in the past, I did not find complete guides where all the needed knowledge is centralised. There are great theoretical and practical explanations that I will be referencing throughout the reading, but I have not found any complete guide that gives a holistic view. That is why I have decided to write this article. Furthermore, I am going to apply what I have collected here to a well-known Kaggle competition called House Prices — Advanced Regression Techniques.

The rest of the article is divided into two sections:

Theory: after a short introduction, we will dive into the original paper to understand the theory behind this great model. Then, a brief visual explanation will help us to better understand the theory.
Practice: after an overview of the XGBoost parameters, I will present a step-by-step guide for tuning the hyperparameters.

All images unless otherwise noted are by the author.

XGBoost stands for eXtreme Gradient Boosting and was officially published by Tianqi Chen and Carlos Guestrin in 2016 [5]. Even before its publication, it was established as one of the best algorithms to use in Kaggle competitions.

Nevertheless, and despite the fact that the rise of Deep Learning is collecting incredible success in fields such as Computer Vision and NLP, XGBoost and other tree-based models (CatBoost [6] or LightGBM [7]) continue to be one of the best options for predicting tabular data [8]. All these tree-based algorithms are based on Gradient Boosting, and if you want to understand how this technique works, I recommend you to check my article on tree ensembles. [9]

Do you want to know what makes XGBoost special? I will explain it in two different subsections: Original Paper and Visual Explanation. Let’s get started!

Original Paper

As mentioned before, XGBoost is based on Gradient Boosting, so several trees are trained sequentially on the residuals of the previous trees. However, there is a combination of some minor improvements that allowed XGBoost to outperform the existing tree ensembles models by preventing overfitting:

Regularized Learning Objective (similar to RGF [10]). In gradient boosting, each tree is trained in the best possible way to achieve the learning objective: reduce the difference between the predictions and the target. This learning objective is replaced in XGBoost by a regularised learning objective, which adds a regularisation term to the calculation of the difference. In plain English, this term adds noise when each tree learns and is intended to reduce the prediction’s sensitivity to individual observations. If the regularization term is set to zero, the objective falls back to the traditional gradient boosting.
Shrinkage (borrowed from Random Forests [11]). A technique that limits the weight that each trained tree has in the final prediction. Thus, the influence of each tree is reduced and there is more space for future trees to improve the predictions. It is similar to the learning rate (and also specified as it in the parameters section).
Column Subsampling (borrowed from Random Forests [11]). It allows to randomly select a subsample of features for each tree, tree level and/or tree node. Therefore, a very important feature for the first tree/level/node, could not be available for the second one, pushing it to use other features.

One of the biggest problems in tree learning is finding the best split. If there is a continuous feature that varies from 0 to 100, what value should be used for doing the split? 20? 32.5? 70? In order to find the best split, XGBoost can apply different algorithms:

Exact greedy algorithm. It is the most commonly used algorithm in older tree-boosting implementations which consists of testing all possible splits of all features. Doing it for continuous features is computationally demanding and therefore requires more time as the data sample increases.
Weighted quantile sketch algorithm: proposes candidate splitting points according to percentiles of feature distributions. The algorithm then maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on aggregated statistics. Furthermore, it can handle weighted data and includes a default direction in each tree node to make the algorithm aware of the sparsity patterns in the data. The sparsity may be due to the presence of missing values, frequent zero entries or the consequences of feature engineering (e.g. use of One Hot Encoding).

In addition, the algorithm is designed to interact with the system efficiently. Without going into detail, I can point out that the data is stored in units in memory called blocks to help the algorithm sort the data. This technique allows parallelisation. Moreover, in the case of large datasets, cache-aware access is defined to avoid slowing down the search for splits when some computations do not fit in the CPU cache. Finally, it uses block compression and block fragmentation to handle data that does not fit in the main memory.

Visual Explanation

This section is highly inspired by Josh Starmer’s Youtube channel and Shreya Rao’s article. My goal is to explain with a little example how XGBoost works and link it with the paper theory seen before. To do this, let’s follow the steps below.

Step 1: Create synthetic data

A small synthetic dataset about house pricing is used. It has two features (Square Meters and Has Garage?) and a target (Price).

Synthetic data about house pricing used for the visual explanation.

Step 2: Calculate the residuals

As you have already seen, residuals have been calculated in the table shown above. The calculation is simple, you only have to subtract the predicted price of the previous tree from the real price (remember that XGBoost train several trees sequentially). However, this is the first tree, so we do not have predicted prices. In that case, a mean of the price is calculated.

Step 3: Build the first tree of XGBoost

The first tree is going to be trained with all the residuals as the target. So the first thing to do is to calculate the similarity score for all the residuals. This is the score that the tree splits intend to augment.

Calculation of the Similarity Score for the first tree.

Do you see the lambda (λ) character? As explained in the paper theory section, this is the regularisation term that helps to prevent overfitting by adding noise. Thus, the XGBoost learning objective (augment the similarity score) is actually a regularized learning objective. The default value for the regularisation term in XGBoost is 1. Now, let’s see how the feature Has Garage? influence in the similarity score and the gain.

Similarity scores calculated for the split with the garage feature.

Final gain for garage split.

Once the similarity scores are calculated, the gain for the feature Has Garage? is 7448,58. If the final gain is positive, then it is a good split, otherwise, it is not. As it is positive, we can conclude that is a good split.

For a continuous feature like Square Meters, the process is slightly different. First, we need to find the best split for the continuous feature. For that purpose, the exact greedy algorithm explained in the paper theory section will be used. That is to say, every possible split is going to be tested.

Sorting the feature and finding all the possible splits (greedy).

Sorting shown in the figure above can be computationally expensive when datasets are big. As commented in the paper theory section, XGBoost uses block units that allow parallelization and help with this problem. Also, remember that XGBoost can use the weighted quantile sketch algorithm to propose candidate splitting points according to percentiles of feature distributions. This is not going to be explained here, but it is one of the main advantages of XGBoost. That being said, we are going to calculate the similarity scores and gain for each possible split.

Calculations of the gain for each Square Meters split.

The split of the Square Meters feature by 175 has the greatest gain (even greater than the Has Garage? split), so it should be the first split. Only one residual is left in the right leaf (156). So let’s focus on the left leaf and look for another split. After all the calculations, the best split is done with Has Garage? feature.

After new calculations, the left leaf can be split again.

Final tree with every possible split done.

Step 4: Pruning the tree

In order to avoid overfitting, there is another process called tree pruning that XGBoost does. From bottom to top every gain is validated. How? If an XGBoost parameter called gamma (γ) is greater than the gain, then the split is removed. The default value of γ in XGBoost is 0, but a 200 value is set to represent a case in which a split is removed. Why? Because in that case, 200 (γ) is greater than the last gain (198). So, the tree is pruned and the final result is:

Last version of the first tree after pruning.

Step 5: Make predictions with the tree

In order to do predictions, the first step is to have a single value (output value) in each final leaf. For that, we use the following formula:

Calculation of the output value (value of each leaf).

The result is:

First tree with an output value for each leaf.

Finally, the output values of the final leaves can be used for doing new predictions with the formula shown below. Being i the observation that we want to predict, prediction_t0 the first prediction (mean of observed price), ɛ the learning rate and leaf_i the value for the observation i if we follow the tree rules.

Note that the learning rate (ɛ) is the shrinkage explanation of the paper theory section. That being said, let’s predict all the observations:

Now, we use the predictions to calculate new residuals (Residuals_1):

As we can see above, every residual (except the first one) is more close to zero. That means that the first tree gives information that improves the first prediction (the mean). The next step would be to create another tree that gives another step in the direction of reducing the residuals.

Step 6: Train new trees

If we want to create a new tree, the only thing to do is to redo steps 3–5 with the new residuals (Residuals_1) as the target. Then, the final prediction would be:

Calculation of the final predictions with two trees.

XGBoost can sequentially train trees using these steps. Personally, I find that the visual explanation is an effective way to comprehend the model and its theory. I hope it was helpful for you as well. Be that as it may, now it’s time to proceed with the practical section.

First, I would like to emphasize that while I am not an expert in the practical side of XGBoost, I have used certain techniques that have proven to be successful. As a result, I think that I can provide a brief guide to help you tune the hyperparameters.

To achieve this goal, it is essential to divide the practice into two subsections. The first one will focus on defining the hyperparameters of the algorithm and compiling them into a concise cheat sheet. The second subsection, called Hands-On Implementation, will provide a step-by-step guide that will walk you through the process of training XGBoost.

Hyperparameters

All the information explained here can be found in the official XGBoost documentation [12]. One of the general parameters is known as booster. As I explained in my previous article [9], gradient boosting is a technique that is not restricted to using decision trees, but any model. That is why XGBoost accepts three values for the booster parameter:

gbtree: a gradient boosting with decision trees (default value)
dart: a gradient boosting with decision trees that uses a method proposed by Vinayak and Gilad-Bachrach (2015) [13] that adds dropout techniques from the deep neural net community to boosted trees.
gblinear: a gradient boosting with linear functions.

While gblinear is the best option to catch linear links between predictors and the outcome, boosters based on decision trees (gbtree and dart) are much better to catch non-linear links. As gbtree is the most used value, the rest of the article is going to use it. If you are interested in using the others, check the documentation [12] because depending on your choice the parameters change.

Once we set gbtree as the booster, there are several parameters that we can tune in order to get the best model. The most important ones are explained here:

eta (a.k.a learning rate): shown in the visual explanation section as ɛ, it limits the weight each trained tree has in the final prediction to make the boosting process more conservative.
gamma: shown in the visual explanation section as γ, it marks the minimum gain required to make a further partition on a leaf node of the tree.
max_depth: sets the maximum depth of a tree.
n_estimators: number of trees trained.
min_child_weight: sets the minimum number of instance weights (sum of residuals) needed in a child for doing a split.
subsample: sets the percentage of the sample got (randomly) before training each tree.
colsample_by[]: a collection of parameters of subsampling columns. The subsampling can be done for each tree (colsample_bytree), for each depth level reached in a tree (colsample_bylevel) or for every time a new split is evaluated (colsample_bynode). Note that these parameters can work simultaneously: if every parameter has 0.5 value and you have 32 columns, then each split would use 4 columns (32/ 2³)
lambda (L2 regularization): shown in the visual explanation as λ. It decreases the output value (step 5 in the visual explanation) smoothly as it increases the denominator. [14]
alpha (L1 regularization): It decreases the output value and promotes sparsity by forcing output values to be 0. [14]
tree_method: it sets the construction algorithm used by the tree for finding the splits. As discussed in the paper theory section, exact or approximate algorithms can be used. In practice, there are 5possible values for this parameter: auto for letting a heuristic select the fastest option among the next ones, exact for applying the exact greedy algorithm that enumerates all split candidates, approx for applying an approximate greedy algorithm that uses quantile sketch and gradient histogram, hist for using a histogram-optimized version of the approximate algorithm and gpu_hist for using the GPU implementation of hist.
scale_pos_weight: it controls the balance of positive and negative weights, which can be useful for unbalanced classes. A typical value to consider is sum(negative instances) / sum(positive instances).
max_leaves: it sets the maximum number of nodes to be added only when the exact value for the tree_method parameter is not chosen.

There are more parameters to tune: updater, refresh_leaf, process_type, grow_policy, max_bin, predictor, num_parallel_tree, monotone_constraints or interaction_constraints. However, the former ones are enough to make the most of your XGBoost model and therefore, those were explained.

That said, how does tuning the explained features affect the model? The cheat sheet shown below helps us to understand the effect of tuning each parameter in different ways. Also, it shows how the specific tuning affects the results, either understanding how the variance and the bias are improved or worsened.

Cheat sheet of main parameters and their effects.

Disclaimer: decreasing or increasing the value means doing it from the perfect bias-variance balance of the specific feature, not from the default value. Also, there are a lot of interactions between the parameters so decrasing some value and increasing others may result different than explained in the table.

Hands-on Implementation

For the implementation, we are going to use two resources. First, we are going to use the data from the Kaggle competition called House Prices [15]. Secondly, and given that the focus of the article is XGBoost and not the rest of the tasks related to a machine learning project, we will borrow code from Nanashi’s notebook [16] to perform the data processing.

Although you will be able to see the most relevant code fragments throughout the reading, all the code can be seen in my GitHub Repository dedicated to Medium articles.

As mentioned before, I am not a great connoisseur of XGBoost training, so I encourage you to share your tricks and techniques in the comments section, or even criticise this way of training if you really think it is wrong. In my experience though, following the next steps is a good way to iteratively improve the model.

Step 0: Data read and processing

As mentioned before, the data is read and processed. It is divided into two parts:

Train set (85% of the data) which contains X_train and y_train. Steps 1–4 will use this data set to build the best possible XGBoost model. During the building process, the mean test score will be discussed, it is necessary to say that this “test” score does not really correspond to the test set, but to a validation test that is generated in the cross-validation process.
Test set (15% of the data) which contains X_test and y_test. It is commonly known as the holdout. When we get the best possible XGBoost model, we should check that the model behaves in the same way with the test set as with the train set.

Code for reading and processing data from House Prices Kaggle competition.

If you want to see the processing details, check the GitHub repository.

Step 1: Create a baseline

The first step is to create a baseline. For that purpose, a simple XGBoost model is created without tuning any parameter. In order to better compare the results, cross-validation [17] with 5 folds is used. The metric used is the Mean Absolute Error (MAE).

Code for creating an XGBoost baseline

That gives us a mean test score of 0.0961 and a main train score of 0.0005. It seems that the bias is low but the variance is high, which can be translated as the presence of overfitting. However, the baseline is only a reference, let’s move on and see what’s next in the second step.

Step 2: Use GridSearchCV for improving the baseline

Now, we will use GridSearchCV [18] to search for a good combination of parameters in a parameter grid. On the first try, we will use parameter values that are close to those used by XGBoost by default:

Code for using GridSearchCV with XGBoost

The best combinations are:

Best parameters combinations found by GridSearchCV.

The best test mean obtained is 0.0840. It is slightly better than the obtained with the first model (0.0961), so now it will be considered as the metric to improve.

Step 3: Tune parameters individually

Once we have beaten our previous baseline score, we can try to understand how tuning the parameters individually affects the results. With the code provided, several parameters can be tested at once. Let’s see, for example, how these parameter values affect the results:

n_estimators: [125, 150, 175, 200, 225]

Each possible value is used in a cross-validation train. The results returned are compiled and shown in a plot that can be seen below.

Code for tuning parameters individually and plotting the scores.

Plot how the number of estimators changes affect the results.

The graph shows that the mean test score can be lower if n_estimators is higher. However, the mean train score is getting close to zero, so overfitting is knocking on the door.

Step 4: Repeat steps 2 and 3.

With the information gathered in step 3, we can redefine the parameters grid in GridSearchCV and try to get a better model. In this case, higher n_estimators are tested. This means that the complexity of the model is going to be higher, so in the parameter grid, we have also included some parameter values that can help to avoid overfitting (lower learning_rate, higher lambda, higher gamma…). The cheat sheet defined in the hyperparameters section can be very useful here.

Code of the second GridSearchCV with XGBoost.

Using this code we achieve a new best mean test score of 0.0814 (previous: 0.0840).

Best parameters combinations found by the second GridSearchCV.

Steps 2–3 should be repeated as many times as needed in order to improve the model.

Step 5: Use the test set to validate the model.

Finally, we need to validate the model built with the test set (holdout test, no validation test) of step 0.

Code for calculating the holdout predictions MAE.

The MAE obtained is 0.0808. As it is lower than our best MAE (0.0814), we can say that our model generalises well and has been trained properly.

In this article, I aim to provide a comprehensive guide for using XGBoost with complete proficiency. Through hours of research on papers, guides, posts, and articles, I believe that I have produced a complete article that can aid in comprehending and using XGBoost in one go.

I hope you have found the reading useful and enjoyable. And above all, I am happy to receive any kind of feedback. So feel free to share your thoughts!

[1] Wolpert, D. H., & Macready, W. G. «No free lunch theorems for optimization». IEEE transactions on evolutionary computation (1997). https://ieeexplore.ieee.org/abstract/document/585893

[2] Bojan Tunguz Twitter Account. https://twitter.com/tunguz

[3] Raschka, Sebastian (2022). Deep Learning for Tabular Data. https://sebastianraschka.com/blog/2022/deep-learning-for-tabular-data.html

[4] Bojan Tunguz Kaggle Profile. https://www.kaggle.com/tunguz

[5] Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794).

[6] Dorogush, A.V.; Ershov, V.; Gulin. A. CatBoost: Gradient Boosting with Categorical Features Support». ArXiv:1810.11363, 24 (2018). http://arxiv.org/abs/1810.11363.

[7] Ke, G.; Meng, Q.; Finley, T; Wang, T; Chen, W; Ma, W; Ye, Q; Liu, T. «LightGBM: A Highly Efficient Gradient Boosting Decision Tree». Advances in Neural Information Processing Systems, 20 (2017). https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html.

[8] Tunguz, Bojan. https://twitter.com/tunguz/status/1620048813686923266?s=20&t=BxzKnvn7G0ieo1I7SfDnSQ

[9]Martín Lasaosa, Jorge. «Tree Ensembles: Bagging, Boosting and Gradient Boosting». In Towards Data Science (Medium). (2022) https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2Ftree-ensembles-theory-and-practice-1cf9eb27781

[10] T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014. https://ieeexplore.ieee.org/abstract/document/6583153

[11] Breiman, L. «Random Forests». Machine Learning 45, (2001): 5–32. https://doi.org/10.1023/A:1010933404324.

[12] XGBoost Documentation (Parameters section) https://xgboost.readthedocs.io/en/stable/parameter.html

[13] Vinayak, R. K.; Gilad-Bachrach, R. «Dart: Dropouts meet multiple additive regression trees.» In Artificial Intelligence and Statistics. PMLR. (2015) https://proceedings.mlr.press/v38/korlakaivinayak15.html

[14] Um, Albert. «L1, L2 Regularization in XGBoost Regression» In Medium (2021) https://albertum.medium.com/l1-l2-regularization-in-xgboost-regression-7b2db08a59e0

[15] House Prices Kaggle Competition. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

[16] Nanashi’s Notebook in Kaggle https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1

[17] Yiu, Tony. «Understanding Cross Validation» In Towards Data Science (Medium) (2020) https://towardsdatascience.com/understanding-cross-validation-419dbd47e9bd

[18] «Hyperparameter Tuning with GridSearchCV» In Great Learning (2022) https://www.mygreatlearning.com/blog/gridsearchcv/