Techno Blender
Digitally Yours.

Can I Trust My Model’s Probabilities? A Deep Dive into Probability Calibration | by Eduardo Blancas | Nov, 2022

0 48


Statistics for Data Science

Photo by Edge2Edge Media on Unsplash

Suppose you have a binary classifier and two observations; the model scores them as 0.6 and 0.99, respectively. Is there a higher chance that the sample with the 0.99 score belongs to the positive class? For some models, this is true, but for others it might not.

This blog post will dive deeply into probability calibration-an essential tool for every data scientist and machine learning engineer. Probability calibration allows us to ensure that higher scores from our model are more likely to belong to the positive class.

The post will provide reproducible code examples with open-source software so you can run it with your data! We’ll use sklearn-evaluation for plotting and Ploomber to execute our experiments in parallel.

Hi! My name is Eduardo, and I like writing about all things data science. If you want to keep up-to-date with my content. Follow me on Medium or Twitter. Thanks for reading!

When training a binary classifier, we’re interested in finding whether a specific observation belongs to the positive class. What positive class means depens on the context. For example, if working on an email filter, it can mean that a particular message is spam. If working on content moderation, it can mean harmful post.

Using a number in a real-valued range provides more information than Yes/No answer. Fortunately, most binary classifiers can output scores (Note that here I’m using the word scores and not probabilities since the latter has a strict definition).

Let’s see an example with a Logistic regression:

The predict_proba function allows us to output the scores (for logistic regression’s case, this are indeeded probabilities):

Console output (1/1):

Each row in the output represents the probability of belonging to class 0 (first column) or class 1 (second column). As expected, the rows add up to 1.

Intuitively, we expect a model to give a higher probability when it’s more confident about specific predictions. For example, if the probability of belonging to class 1 is 0.6, we might assume the model isn’t as confident as with one example whose probability estimate is 0.99. This is a property exhibited by well-calibrated models.

This property is advantageous because it allows us to prioritize interventions. For example, if working on content moderation, we might have a model that classifies content as not harmful or harmful; once we obtain the predictions, we might decide to only ask the review team to check the ones flagged as harmful, and ignore the rest. However, teams have limited capacity, so it’d be better to only pay attention to posts with a high probability of being harmful. To do that, we could score all new posts, take the top N with the highest scores, and then hand over these posts to the review team.

However, models do not always exhibit this property, so we must ensure our model is well-calibrated if we want to prioritize predictions depending on the output probability.

Let’s see if our logistic regression is calibrated.

Console output (1/1):

Let’s now group by probability bin and check the proportion of samples within that bin that belong to the positive class:

Console output (1/1):

We can see that the model is reasonably calibrated. No sample belongs to the positive class for outputs between 0.0 and 0.1. For the rest, the proportion of actual positive class samples is close to the value boundaries. For example, for the ones between 0.3 and 0.4, 29% belong to the positive class. A logistic regression returns well-calibrated probabilities because of its loss function.

It’s hard to evaluate the numbers in a table; this is where a calibration curve comes in, allowing us to assess calibration visually.

A calibration curve is a graphical representation of a model’s calibration. It allows us to benchmark our model against a target: a perfectly calibrated model.

A perfectly calibrated model will output a score of 0.1 when it’s 10% confident that the model belongs to the positive class, 0.2 when it’s 20%, and so on. So if we draw this, we’d have a straight line:

A perfectly calibrated model. Image by author.

Furthermore, a calibration curve allows us to compare several models. For example, if we want to deploy a well-calibrated model into production, we might train several models and then deploy the one that is better calibrated.

We’ll use a notebook to run our experiments and change the model type (e.g., logistic regression, random forest, etc.) and the dataset size. You can see the source code here.

The notebook is straightforward: it generates sample data, fits a model, scores out-of-sample predictions, and saves them. After running all the experiments, we’ll download the model’s predictions and use them to plot the calibration curve along with other plots.

To accelerate our experimentation, we’ll use Ploomber Cloud, which allows us to parametrize and run notebooks in parallel.

Note: the commands in this section as bash commands. Run them in a terminal or add the %%sh magic if you execute them in Jupyter.

Let’s download the notebook:

Console output (1/1):

Now, let’s run our parametrized notebook. This will trigger all our parallel experiments:

Console output (1/1):

After a minute or so, we’ll see that all our 28 experiments are finished executing:

Console output (1/1):

Let’s download the probability estimations:

Console output (1/1):

Each experiment stores the model’s predictions in a .parquet file. Let’s load the data to generate a data frame with the model type, sample size, and path to the model’s probabilities (as generated by the predict_proba method).

Console output (1/1):

name is the model name. n_samples is the sample size, and path is the path to the output data generated by each experiment.

Logistic regression is a special case since it’s well-calibrated by design given that its objective function minimizes the log-loss function.

Let’s see its calibration curve:

Console output (1/1):

Logistic regression calibration curve. Image by author.

You can see that the probability curve closely resembles one of a perfectly calibrated model.

In the previous section, we showed that logistic regression is designed to produce calibrated probabilities. But beware of the sample size. If you don’t have a large enough training set, the model might not have enough information to calibrate the probabilities. The following plot shows the calibration curves for a logistic regression model as the sample size increases:

Console output (1/1):

Logistic regression calibration curve for different sample sizes. Image by author.

You can see that with 1,000 samples, the calibration is poor. However, once you pass 10,000 samples, more data does not significantly improve the calibration. Note that this effect depends on the dynamics of your data; you might need more or fewer data in your use case.

While a logistic regression is designed to produce calibrated probabilities, other models do not exhibit this property. Let’s look at the calibration plot for an AdaBoost classifier:

Console output (1/1):

Calibration curve for AdaBoost with different sample sizes. Image by author.

You can see that the calibration curve looks highly distorted: the fraction of positives (y-axis) is far from its corresponding mean predicted value (x-axis); furthermore, the model doesn’t even produce values along the full 0.0 to 1.0 axis.

Even at a sample size of 1000,000, the curve could be better. In upcoming sections, we’ll see how to address this problem, but for now, remember this: not all models will produce calibrated probabilities by default. In particular, maximum margin methods such as boosting (AdaBoost is one of them), SVMs, and Naive Bayes yield uncalibrated probabilities (Niculescu-Mizil and Caruana, 2005).

AdaBoost (unlike logistic regression) has a different optimization objective that does not produce calibrated probabilities. However, this does not imply an inaccurate model since classifiers are evaluated by their accuracy when creating a binary response. Let’s compare the performance of both models.

Now we plot and compare the classification metrics. AdaBoost’s metrics are displayed in the upper half of each square, while Logistic regression ones are in the lower half. We’ll see that both models have similar performance:

Console output (1/1):

AdaBoost and logistic regression metrics comparison. Image by author.

Until now, we’ve solely used the calibration curve to judge whether a classifier is calibrated. However, another crucial factor to take into account is the distribution of the model’s predictions. That is, how common or rare score values are.

Let’s look at the random forest calibration curve:

Console output (1/1):

Random forest vs logistic regression calibration curve. Image by author.

The random forest follows a similar pattern as the logistic regression: the larger the sample size, the better the calibration. Random forests are known to provide well-calibrated probabilities (Niculescu-Mizil and Caruana, 2005).

However, this is only part of the picture. First, let’s look at the distribution of the output probabilities:

Console output (1/1):

Random forest vs logistic regression distribution of probabilities. Image by author.

We can see that the random forest pushes the probabilities towards 0.0 and 1.0, while the probabilities from the logistic regression are less skewed. While the random forest is calibrated, there aren’t many observations in the 0.2 to 0.8 region. On the other hand, the logistic regression has support all along the 0.0 to 1.0 area.

An even more extreme example is when using a single tree: we’ll see an even more skewed distribution of probabilities.

Console output (1/1):

Decision tree distribution of probabilities. Image by author.

Let’s look at the probability curve:

Console output (1/1):

Decision tree probability curves for different sample sizes. Image by author.

You can see that the two points we have ( 0.0, and 1.0) are calibrated (they’re reasonably close to the dotted line). However, no more data exists because the model didn’t output probabilities with other values.

Training/Calibration/Test split. Image by author.

There are a few techniques to calibrate classifiers. They work by using your model’s uncalibrated predictions as input for training a second model that maps the uncalibrated scores to calibrated probabilities. We must use a new set of observations to fit the second model. Otherwise, we’ll introduce bias in the model.

There are two widely used methods: Platt’s method and Isotonic regression. Platt’s method is recommended when the data is small. In contrast, Isotonic regression is better when we have enough data to prevent overfitting (Niculescu-Mizil and Caruana, 2005).

Consider that calibration won’t automatically produce a well-calibrated model. The models whose predictions can be better calibrated are boosted trees, random forests, SVMs, bagged trees, and neural networks (Niculescu-Mizil and Caruana, 2005).

Remember that calibrating a classifier adds more complexity to your development and deployment process. So before attempting to calibrate a model, ensure there aren’t more straightforward approaches to take such a better data cleaning or using logistic regression.

Let’s see how we can calibrate a classifier using a train, calibrate, and test split using Platt’s method:

Console output (1/1):

Uncalibrated vs calibrated model. Image by author.

Alternatively, you might use cross-validation and the test fold to evaluate and calibrate the model. Let’s see an example using cross-validation and Isotonic regression:

Using cross-validation for calibration. Image by author.

Console output (1/1):

Uncalibrated vs calibrated model (using cross-validation). Image by author.

In the previous section, we discussed methods for calibrating a classifier (Platt’s method and Isotonic regression), which only support binary classification.

However, calibration methods can be extended to support multiple classes by following the one-vs-all strategy as shown in the following example:

Console output (1/1):

Uncalibrated vs calibrated multi-class model. Image by author.

In this blog post, we took a deep dive into probability calibration, a practical tool that can help you develop better predictive models. We also discussed why some models exhibit calibrated predictions without extra steps while others need a second model to calibrate their predictions. Through some simulations, we also demonstrated the sample size’s effect and compared several models’ calibration curves.

To run our experiments in parallel, we used Ploomber Cloud, and to generate our evaluation plots, we used sklearn-evaluation. Ploomber Cloud has a free tier, and sklearn-evaluation is open-source, so you can grab this post in notebook format from here, get an API Key and run the code with your data.

If you have questions, feel free to join our community!

Here are the versions we used for the code examples:

Console output (1/1):




Statistics for Data Science

Photo by Edge2Edge Media on Unsplash

Suppose you have a binary classifier and two observations; the model scores them as 0.6 and 0.99, respectively. Is there a higher chance that the sample with the 0.99 score belongs to the positive class? For some models, this is true, but for others it might not.

This blog post will dive deeply into probability calibration-an essential tool for every data scientist and machine learning engineer. Probability calibration allows us to ensure that higher scores from our model are more likely to belong to the positive class.

The post will provide reproducible code examples with open-source software so you can run it with your data! We’ll use sklearn-evaluation for plotting and Ploomber to execute our experiments in parallel.

Hi! My name is Eduardo, and I like writing about all things data science. If you want to keep up-to-date with my content. Follow me on Medium or Twitter. Thanks for reading!

When training a binary classifier, we’re interested in finding whether a specific observation belongs to the positive class. What positive class means depens on the context. For example, if working on an email filter, it can mean that a particular message is spam. If working on content moderation, it can mean harmful post.

Using a number in a real-valued range provides more information than Yes/No answer. Fortunately, most binary classifiers can output scores (Note that here I’m using the word scores and not probabilities since the latter has a strict definition).

Let’s see an example with a Logistic regression:

The predict_proba function allows us to output the scores (for logistic regression’s case, this are indeeded probabilities):

Console output (1/1):

Each row in the output represents the probability of belonging to class 0 (first column) or class 1 (second column). As expected, the rows add up to 1.

Intuitively, we expect a model to give a higher probability when it’s more confident about specific predictions. For example, if the probability of belonging to class 1 is 0.6, we might assume the model isn’t as confident as with one example whose probability estimate is 0.99. This is a property exhibited by well-calibrated models.

This property is advantageous because it allows us to prioritize interventions. For example, if working on content moderation, we might have a model that classifies content as not harmful or harmful; once we obtain the predictions, we might decide to only ask the review team to check the ones flagged as harmful, and ignore the rest. However, teams have limited capacity, so it’d be better to only pay attention to posts with a high probability of being harmful. To do that, we could score all new posts, take the top N with the highest scores, and then hand over these posts to the review team.

However, models do not always exhibit this property, so we must ensure our model is well-calibrated if we want to prioritize predictions depending on the output probability.

Let’s see if our logistic regression is calibrated.

Console output (1/1):

Let’s now group by probability bin and check the proportion of samples within that bin that belong to the positive class:

Console output (1/1):

We can see that the model is reasonably calibrated. No sample belongs to the positive class for outputs between 0.0 and 0.1. For the rest, the proportion of actual positive class samples is close to the value boundaries. For example, for the ones between 0.3 and 0.4, 29% belong to the positive class. A logistic regression returns well-calibrated probabilities because of its loss function.

It’s hard to evaluate the numbers in a table; this is where a calibration curve comes in, allowing us to assess calibration visually.

A calibration curve is a graphical representation of a model’s calibration. It allows us to benchmark our model against a target: a perfectly calibrated model.

A perfectly calibrated model will output a score of 0.1 when it’s 10% confident that the model belongs to the positive class, 0.2 when it’s 20%, and so on. So if we draw this, we’d have a straight line:

A perfectly calibrated model. Image by author.

Furthermore, a calibration curve allows us to compare several models. For example, if we want to deploy a well-calibrated model into production, we might train several models and then deploy the one that is better calibrated.

We’ll use a notebook to run our experiments and change the model type (e.g., logistic regression, random forest, etc.) and the dataset size. You can see the source code here.

The notebook is straightforward: it generates sample data, fits a model, scores out-of-sample predictions, and saves them. After running all the experiments, we’ll download the model’s predictions and use them to plot the calibration curve along with other plots.

To accelerate our experimentation, we’ll use Ploomber Cloud, which allows us to parametrize and run notebooks in parallel.

Note: the commands in this section as bash commands. Run them in a terminal or add the %%sh magic if you execute them in Jupyter.

Let’s download the notebook:

Console output (1/1):

Now, let’s run our parametrized notebook. This will trigger all our parallel experiments:

Console output (1/1):

After a minute or so, we’ll see that all our 28 experiments are finished executing:

Console output (1/1):

Let’s download the probability estimations:

Console output (1/1):

Each experiment stores the model’s predictions in a .parquet file. Let’s load the data to generate a data frame with the model type, sample size, and path to the model’s probabilities (as generated by the predict_proba method).

Console output (1/1):

name is the model name. n_samples is the sample size, and path is the path to the output data generated by each experiment.

Logistic regression is a special case since it’s well-calibrated by design given that its objective function minimizes the log-loss function.

Let’s see its calibration curve:

Console output (1/1):

Logistic regression calibration curve. Image by author.

You can see that the probability curve closely resembles one of a perfectly calibrated model.

In the previous section, we showed that logistic regression is designed to produce calibrated probabilities. But beware of the sample size. If you don’t have a large enough training set, the model might not have enough information to calibrate the probabilities. The following plot shows the calibration curves for a logistic regression model as the sample size increases:

Console output (1/1):

Logistic regression calibration curve for different sample sizes. Image by author.

You can see that with 1,000 samples, the calibration is poor. However, once you pass 10,000 samples, more data does not significantly improve the calibration. Note that this effect depends on the dynamics of your data; you might need more or fewer data in your use case.

While a logistic regression is designed to produce calibrated probabilities, other models do not exhibit this property. Let’s look at the calibration plot for an AdaBoost classifier:

Console output (1/1):

Calibration curve for AdaBoost with different sample sizes. Image by author.

You can see that the calibration curve looks highly distorted: the fraction of positives (y-axis) is far from its corresponding mean predicted value (x-axis); furthermore, the model doesn’t even produce values along the full 0.0 to 1.0 axis.

Even at a sample size of 1000,000, the curve could be better. In upcoming sections, we’ll see how to address this problem, but for now, remember this: not all models will produce calibrated probabilities by default. In particular, maximum margin methods such as boosting (AdaBoost is one of them), SVMs, and Naive Bayes yield uncalibrated probabilities (Niculescu-Mizil and Caruana, 2005).

AdaBoost (unlike logistic regression) has a different optimization objective that does not produce calibrated probabilities. However, this does not imply an inaccurate model since classifiers are evaluated by their accuracy when creating a binary response. Let’s compare the performance of both models.

Now we plot and compare the classification metrics. AdaBoost’s metrics are displayed in the upper half of each square, while Logistic regression ones are in the lower half. We’ll see that both models have similar performance:

Console output (1/1):

AdaBoost and logistic regression metrics comparison. Image by author.

Until now, we’ve solely used the calibration curve to judge whether a classifier is calibrated. However, another crucial factor to take into account is the distribution of the model’s predictions. That is, how common or rare score values are.

Let’s look at the random forest calibration curve:

Console output (1/1):

Random forest vs logistic regression calibration curve. Image by author.

The random forest follows a similar pattern as the logistic regression: the larger the sample size, the better the calibration. Random forests are known to provide well-calibrated probabilities (Niculescu-Mizil and Caruana, 2005).

However, this is only part of the picture. First, let’s look at the distribution of the output probabilities:

Console output (1/1):

Random forest vs logistic regression distribution of probabilities. Image by author.

We can see that the random forest pushes the probabilities towards 0.0 and 1.0, while the probabilities from the logistic regression are less skewed. While the random forest is calibrated, there aren’t many observations in the 0.2 to 0.8 region. On the other hand, the logistic regression has support all along the 0.0 to 1.0 area.

An even more extreme example is when using a single tree: we’ll see an even more skewed distribution of probabilities.

Console output (1/1):

Decision tree distribution of probabilities. Image by author.

Let’s look at the probability curve:

Console output (1/1):

Decision tree probability curves for different sample sizes. Image by author.

You can see that the two points we have ( 0.0, and 1.0) are calibrated (they’re reasonably close to the dotted line). However, no more data exists because the model didn’t output probabilities with other values.

Training/Calibration/Test split. Image by author.

There are a few techniques to calibrate classifiers. They work by using your model’s uncalibrated predictions as input for training a second model that maps the uncalibrated scores to calibrated probabilities. We must use a new set of observations to fit the second model. Otherwise, we’ll introduce bias in the model.

There are two widely used methods: Platt’s method and Isotonic regression. Platt’s method is recommended when the data is small. In contrast, Isotonic regression is better when we have enough data to prevent overfitting (Niculescu-Mizil and Caruana, 2005).

Consider that calibration won’t automatically produce a well-calibrated model. The models whose predictions can be better calibrated are boosted trees, random forests, SVMs, bagged trees, and neural networks (Niculescu-Mizil and Caruana, 2005).

Remember that calibrating a classifier adds more complexity to your development and deployment process. So before attempting to calibrate a model, ensure there aren’t more straightforward approaches to take such a better data cleaning or using logistic regression.

Let’s see how we can calibrate a classifier using a train, calibrate, and test split using Platt’s method:

Console output (1/1):

Uncalibrated vs calibrated model. Image by author.

Alternatively, you might use cross-validation and the test fold to evaluate and calibrate the model. Let’s see an example using cross-validation and Isotonic regression:

Using cross-validation for calibration. Image by author.

Console output (1/1):

Uncalibrated vs calibrated model (using cross-validation). Image by author.

In the previous section, we discussed methods for calibrating a classifier (Platt’s method and Isotonic regression), which only support binary classification.

However, calibration methods can be extended to support multiple classes by following the one-vs-all strategy as shown in the following example:

Console output (1/1):

Uncalibrated vs calibrated multi-class model. Image by author.

In this blog post, we took a deep dive into probability calibration, a practical tool that can help you develop better predictive models. We also discussed why some models exhibit calibrated predictions without extra steps while others need a second model to calibrate their predictions. Through some simulations, we also demonstrated the sample size’s effect and compared several models’ calibration curves.

To run our experiments in parallel, we used Ploomber Cloud, and to generate our evaluation plots, we used sklearn-evaluation. Ploomber Cloud has a free tier, and sklearn-evaluation is open-source, so you can grab this post in notebook format from here, get an API Key and run the code with your data.

If you have questions, feel free to join our community!

Here are the versions we used for the code examples:

Console output (1/1):

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment