How Does XGBoost Handle Multiclass Classification? | by Saupin Guillaume | Jan, 2023

By Jessie Hobb On Jan 7, 2023

It’s crucial to understand the underlying workings of classification using this kind of model, as it impacts performance

In this article, we are going to see how the ensemble of decision trees trained using Gradient Boosting libraries like XGBoost, LightGBM and CatBoost performs multiclass classification.

Indeed, an ensemble of decision trees associates a real value to a set of features, so the question is: how do decision tree ensembles transform a scalar value to a multiclass label?

Understanding the underlying workings of classification using this kind of model is crucial, as it impacts performance.

We will enter progressively into the subject following the plan below:

Reminder and toy example of binary classification in Python
First binary classification using XGBoost as a regressor
Second binary classification using XGBoost as a classifier
Multiclass classification using XGBoost

XGBoost, LightGBM, or CatBoost are libraries that share (by default) the same kind of underlying model: decision trees.

These decision trees are combined iteratively, using Gradient Boosting. I.e. the addition of new nodes to the current tree is done so that a non-linear objective, usually the squared error, is optimized. To handle the non-linearity, the objective is linearized using its Gradient and Hessian.

Hence the name Gradient Boosting. More detail in my previous paper:

As a reminder, the prediction process is relatively simple: given a row of data, each decision tree of the ensemble is browsed.

Depending on the value of features, each tree then associates a unique value, attached to the final leaf.

The unique predictions of each tree are then simply summed up to give the overall prediction.

The figure below illustrates this in a simple example, where an ensemble of decision trees models the identity function for an integer between 1 and 4:

A simple ensemble of decision trees. Schema from the author.

For instance, when the input is 1, the first tree generates 8, the second tree -6, and the last one -1. Summing these three values gives 1, which is the expected output.

This example is extracted from my book, Practical Gradient Boosting, on gradient boosting:

Using a single scalar value, the best we can do is perform a binary classification, labelling negative predictions with one class, and positive ones with the other one.

Binary classification without XGBoost

Before discovering this first option, i.e. binary classification with XGBoost as a regressor, let’s show in detail how binary classification is done.

The problem we are trying to solve here is simple: we want to establish a student’s probability of success depending on the number of hours he spends studying his subject.

The figure below shows the data collected, i.e. the number of hours of work and the results: pass or failed.

The probability of success depends on study hours. Plot by the author.

The standard model that is used for classification is the logistic function. This function is similar to linear regression, except that instead of taking a value in the range ℝ, it generates only values in the range [0, 1]. Its formula is worth being known:

Logistic function. Formula by the author.

As always in machine learning, finding the best parameters for a model, here the logistic function, is done by minimizing an error. Facing a binary problem, where the positive output can be modelled by a 1 and the negative output by a 0, it’s possible to combine both errors in a single expression:

Simple error function. Formula by the author.

Where the y_k are the observed samples whereas f(x_k) are the prediction made by the model f.

The difficulty with this basic error is that, accordingly to the binary nature of the logistic function, which mainly takes only two values: zero and one, the error with respect to the model parameter mwill also take mainly two values. Hence outside the vicinity of the optimal parameters, the error will be flat.

Saturation of the error when using directly the error. Plot by the author.

We could use this formula, and this would work, as long as we provide a pretty good estimate of the optimal parameter. If it’s not the case, we risk ending up in the flat zone where the error is almost constant. In this area, the gradient will be almost zero, and the convergence of the steepest descent will be agonizingly slow.

We need a way to process the error output, which is limited to the range [0, 1] for a given sample to ℝ+ so that there is no more saturation.

With the additional constraint that a null error must remain a null error after the transformation.

The trick is to realize that log(1) is zero, whereas log(0) is –∞.

Therefore the log-loss is used:

Log-loss. Formula by the author.

Where the y_k are the observed samples whereas f(x_k) are the prediction made by the model f. Note the minus sign in front of the addition operator and the inversion of 1-f(x_k) with f(x_k). This is because log(1)=0.

Using the log loss, the error is no longer saturated:

The plot of the error using the log loss. Plot by the author.

The simplest way to minimize this error is to use the steepest descent, which only requires computing the gradient of the error. Many options are possible to do that. Here we are going to use symbolic differentiation using sympy:

Logistic regression. Code by the author.

The algorithm found the expected value, 14.77, which is very close to the theoretical one.

Let’s now go back to our subject, binary classification with decision trees and gradient boosting.

Binary classification with XGBoost

Let’s start with a simple example, using the Cleveland Heart Disease Dataset (CC BY 4.0), where the classification is done using regression. As we are performing a binary classification, it is possible to use a simple regression, as we can attach a positive value, 1.0, to positive labels, and a negative value, -1, to negative labels:

Performing classification using a regressor. Code by the author.

The default error used by XGBoost is the squared error. The predictions are rounded to integers, and as you can see thanks to the confusion matrix, the model performs prediction without error.

A similar result can be achieved using directly an XGBoost classifier:

Performing classification using a classifier. Code by the author.

In this case, there is no need to round predictions to get the corresponding class. All the job is done natively by the XGBClassifier. Let’s see how XGBoost handles that.

XGBClassifier trains multiple models

In fact, when you are doing classification with XGBoost, using the XGBClassifier (or xgb.train with the right parameters for classification), XGBoost does in fact train multiple models, one for each class.

The snippet of code below shows how to get more insight into the internals of XGBoost.

Getting the individual probabilities for each class. Code by the author.

More specifically, the predict_proba the method allows getting access to the raw data generated by the internal models. This clearly reveals that when doing classification XGBoost makes a probability prediction for each class.

The predicted class is then the one with the highest probability.

Looking at the code that is done to integrate XGBoost into sklearn, we have the confirmation that XGBoost makes multiple predictions:

Extract from the open source code of XGBoost. See https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py#L1541

As can be seen, line 25, argmax is used to retrieve the index of the class with the highest probability when softprob is used. In the case where the objective used is softmax, the prediction is simply cast into integers.

How does XGBoost perform multiclass classification?

Usually, the explanations regarding how XGBoost handle multiclass classification state that it trains multiple trees, one for each class.

This is not exactly the case. In fact, all the trees are constructed at the same time, using a vector objective function instead of a scalar one. I.e. there is an objective for each class.

The XGBoost documentation gives an example of such an objective:

Extract from the XGBoost documentation. See https://xgboost.readthedocs.io/en/stable/python/examples/custom_softmax.html

There are two things very interesting in this snippet of code:

The objective name is multi:softprob when using the integrated objective in XGBoost. This is quite confusing, as the aim is not really the softprob , but the log loss of the softmax. This appears clearly in the code, as the gradient is directly the softmax. But softmax is not the gradient of softmax , but the gradient of its log loss:

The gradient of the log loss of softmax. Formulas by the author.

2. The other point is that the code uses a variable hess that stands for the hessian. However, this is not really the hessian that is used mathematically speaking, but the second derivative. Hence the right name for this would be a laplacian.