Finding the Best Classification Threshold for Imbalanced Classifications with the Interactive Confusion Matrix and Line Charts | by Luca Zavarella | Jun, 2022

By Jessie Hobb On Jun 8, 2022

Take your analysis of binary classification problems to the next level using the binclass-tools amazing Python package

Image by the author

Data Scientists who are about to train classification models often find themselves analyzing the resulting confusion matrix to see if the model’s performance is satisfactory. Let’s take a closer look at what this is all about.

Introducing the Confusion Matrix

The term “confusion” refers to the fact that an observation can be correctly or incorrectly predicted by the model, which then can be confused in the classification. If we consider a binary classification model (whose possible outcomes can be only two, e.g., true or false), the confusion matrix of the model is a matrix that organizes the prediction outputs obtained for the target variable of a test dataset (given as input to the model) in relation to the true values that this variable takes on in the dataset. In this way, it is possible to identify the number of correct predictions and the number of wrong ones (false positives and false negatives). Having defined the class of interest in the problem as positive (e.g., the transaction is fraudulent, so true is the positive class), false positives are those observations predicted to be true (positive class predicted), but in fact are not (false prediction); false negatives, on the other hand, are those observations predicted to be false (negative class predicted), but in fact are not (false prediction). Having understood these definitions, you can derive for yourself the meaning of true positives and true negatives.

This arrangement of the results of the model allows to understand if the performance of the model is as expected. Generally, a confusion matrix of a binary model looks as follows:

Fig. 1 — Example of confusion matrix of a binary classification (image by the author)

Once identified the values of the 4 quadrants of the confusion matrix, it is possible to determine quantitative and qualitative indicators that measure the performances of the model, as shown in references (see ref. 1).

At first glance it might appear that the values of the four quadrants of the confusion matrix are invariable for the model under analysis. Behind the scenes, however, those values are calculated based on a very specific assumption.

Be Aware to the Classification Threshold

A binary classification model primarily returns a like-probability score for each class in the target variable, which gives a measure of how likely it is that the prediction obtained for that observation is the positive class. If you’re using Python to train classification models, the score we’re talking about is often obtained via the Scikit-learn predict_proba function. This score has the following properties:

It’s a real number between 0 and 1
Adding the values of the score that associates an observation to the positive class and the score that associates the same observation to the negative class, we obtain 1

Anyone who has studied a bit of mathematics will recognize that these properties are the same ones that contribute to the definition of a probability measure. Our score, however, is not a true probability measure, even if it is possible to make our score approximate a probability measure by applying a transformation called model calibration. That said, the assumption about the values of the confusion matrix that we talked about earlier is as follows:

In general, the values of TP, TN, FP, FN shown by a confusion matrix associated with a binary classification model are calculated considering that a prediction is a positive class if the score is greater than or equal to 0.5, it is a negative class if it is less than 0.5

The 0.5 value of the previous statement is called the classification threshold and can vary between 0 and 1. To understand what the threshold is for, imagine that you have trained a binary classification model for detecting smoke from a house fire. Your model will be implemented in a real smoke detector to be installed in the kitchen. Now suppose that the lab testing the product mounts the detector near a stove on which meat is being grilled. If the sensor threshold is set to a low value, it is very likely that the detector will report a fire (false positive) when grilling meat. To make the detector more reliable, lab technicians should increase the threshold to a value deemed appropriate to identify a real fire after several experiments.

Varying the threshold, we go to modify the predicted classes. For example, an observation that is classified true with a score of 0.54 using a threshold of 0.5 (score 0.54 is greater than the threshold 0.5, so the observation is true), becomes false if the threshold changes to 0.6 (score 0.54 is less than the threshold 0.6, so it’s false). Consequently, by varying the threshold, all the quantitative and qualitative indicators we mentioned earlier (for example, precision and recall) also vary. So, it is evident that:

Classifier performance varies as the threshold changes

You understand well, therefore, that to take for good a classifier that uses the threshold with the default value of 0.5 can lead to low quality results, especially when dealing with imbalanced datasets, as the probability distribution for imbalanced data tends to be biased toward the majority class.

At this point, once the model has been trained, the question is: “Okay, now what threshold value should I use for my model?”. The bad news, which often applies to any complex decision-making situation, is:

There is no optimal threshold value for all situations. It depends on the business need to be met.

Before we go into detail about the different cases of threshold tuning, let’s do a quick reminder about the metrics most in use for imbalanced classifications, which are more complex models to train and measure.

Let us recall below some basic concepts useful for beginners to be able to measure the performance of an imbalanced binary classification. The same concepts will then be useful in interpreting optimal threshold values as the cases under analysis vary.

The Precision-Recall tradeoff

Suppose you need to train a binary classification model to detect fraudulent credit card transactions. This is a typical imbalanced problem, for which the number of labels in the class of interest (fraudulent transaction = positive class) is far less than the number of labels in the negative class. In this case, the metrics that are often considered in order to evaluate the classifier are Precision and Recall (in case of the balanced problem, Sensitivity and Specificity are often considered).

Now, the question that needs to be asked is: “Is it more serious to classify a healthy transaction as fraudulent, or a fraudulent transaction as healthy?”. If a check is forced on a healthy transaction, the cost you have is simply that pertaining to the time required for the check. On the other hand, if a fraudulent transaction is misclassified as healthy, the cost of fraud is obviously higher than the previous case.

A fraudulent transaction (positive class) misclassified as healthy (negative class) belongs to the False Negative (FN) set . One metric that increases (leaving the number of TPs constant) when FNs decrease is precisely Recall. In fact, looking at the definition of Recall we have:

Fig. 2 — Definition of Recall (image by the author)

From the above formula, it is evident that if we want to minimize the number of FNs by making it tend to 0, the value of Recall will tend to TP/TP = 1.

So, the first thing that would come is to move the threshold to that value that maximizes the Recall to 1. Unfortunately, one will find that to have a Recall tending to 1 one must move the threshold very close to 0. But that would mean classifying virtually all transactions as fraudulent! Practically the number of False Positives (FPs), not present in the Recall formula, grows disproportionately, and the ability of the classifier to distinguish among all observations predicted as fraudulent those that are, in fact, fraudulent is nullified. In other words, the value of the metric called Precision drops dramatically toward zero, as can be seen from the following formula that defines it:

Fig. 3 — Definition of Precision (image by the author)

It can be seen from the above formula that if FPs tend to a number N much larger than TP, Precision will tend to TP/N~0

What you learned about is the famous Precision-Recall tradeoff, which consists of the following:

It isn’t possible to have both high Precision and high Recall, unless you are dealing with a perfect model (FP = FN = 0). For less-than-perfect models, if you increase Precision, you reduce Recall and vice versa.

Below is shown the Precision-Recall curve of an example model, demonstrating the above tradeoff:

Fig. 4 — Precision-Recall curve (image by the author)

In order to account for both Precision and Recall, we can consider the metric given by the harmonic mean of the two, also called F1-score:

Fig. 5 — Definition of F1 (image by the author)

Should you wish to give more weight to one of the defining metrics or the other, you can generalize the formula by introducing an “unbalance” β coefficient:

Fig. 6 — Definition of F-beta (image by the author)

For example, if you believe that Recall is twice as important as Precision, you can set β to 2. Conversely, if you believe that Precision is two more important than Recall, then β will be ½.

Returning to our problem of fraudulent transactions, having established that Recall is much more important than Precision, you could consider β > 1 and optimize the threshold so that F-score takes the maximum value.

From Precision, Recall and F-score to Matthews Correlation Coefficient

There is an important fact to take into account. Precision and Recall (and, therefore, also the F-score, which is a function of the two) consider the positive class to be the one of interest, answering to the following questions:

“Out of all the positive predicted examples (FP + TP), how many positive detections (TP) were correct?” (Precision)

“Out of all actual positive examples (FN + TP), how many positive (TP) were we able to identify?” (Recall)

If you look closely at the above definitions, True Negatives (TN) never appear. This means that, whether our model can correctly identify all negative classes or not even one, the Precision and Recall values would always be the same.

Hence the need to introduce new metrics that also took into account TNs.

Those who have studied some statistics have certainly come across Cramér’s correlation coefficient (Cramér’s V), which measures the strength of association between two categorical variables. Well, in 1975 Brian W. Matthews introduced a correlation coefficient that is a special case of Cramér’s V applied to the 2×2 confusion matrix. We are talking about the Matthews Correlation Coefficient (MCC) defined as follows:

Fig. 7 — Definition of Matthews correlation coefficient (image by the author)

As well as the usual correlation coefficients, MCC also takes values in the range [-1, 1]. In the case of a classification, the value -1 indicates that the model predicts positive and negative classes in exactly the opposite way to the actual values they take on in the target variable. If MCC takes the value 0, then the model randomly predicts positive and negative classes. In the case of value 1, the model is perfect. Moreover, MCC is invariant if the positive class is renamed negative and vice versa. The main feature of the MCC is as follows:

MCC is the only metric that scores high only if the model was able to correctly predict most positive observations and most negative observations.

After this quick recap of the metrics most in use in imbalanced classification problems (which are the most complex to handle), we can return to the main subject of the article, that is how to set the threshold value in these cases.

Identifying the “best value” of the threshold means finding the value that maximizes or minimizes a specific objective function, which measures the goodness of the model and fits the business problem to be solved. Ultimately, it is an optimization problem.

Basically, there are two approaches to optimizing the threshold value:

Optimization based on specific metrics
Optimization based on costs

In the first case, one selects the metric of interest (e.g., F2-score, or MCC) and identifies the threshold value for which the selected metric indicates the maximum performance of the model (meaning that one must maximize or minimize the metric depending on its nature).

In the second case, on the other hand, the so-called cost matrix is used, thanks to which a cost can be associated with each category of the confusion matrix. In this way, the optimal threshold is given by minimizing the sum of the costs associated with each category.

Once the objective function of interest has been identified, whether it is a metric or a cost, and whether it needs to be maximized or minimized, it would be possible to find the optimal threshold value by operating what is known as threshold shifting:

Vary the threshold value from 0 to 1, with a step of, for example, 0.01, recording all the values of the objective function for each threshold.
Select the threshold value that maximizes (or minimizes) the function.

Though, the value identified in this way would be strictly dependent on the train dataset and it may not be a good estimate of the optimal value of the threshold.

A generalized method, always based on values of the objective function obtained from the training dataset, named GHOST (Generalized tHreshOld ShifTing Procedure) has recently been developed. In summary, it uses N subsets of the train dataset drawn using stratified random sampling so that they preserve the class distribution. Then, for each subset it applies the objective function given by each threshold value. In this way, for each threshold there will be N values of the objective function given by the N subsets. At that point it calculates the median for each threshold of the above values, so that you get only a single “stable” value associated with the threshold value. You can find more details in the references (ref. 7).

Now that the big picture of all the aspects of an in-depth analysis is clearer, you understand well that one of the main difficulties is having a view of the most important metrics at hand as the threshold changes. Only then you will have the ability to quickly understand when the performance of the model approaches the desired performance based on the selected business criterion.

It was this need that prompted me to develop a new Python package that contained some useful tools for this type of analysis. We are talking about the binclass-tools package.

Let’s look at some of the most important tools provided by the package.

The Interactive Confusion Matrix

One of the most interesting tools in the package is the Interactive Confusion Matrix, an interactive plot that allows you to see how the most important metrics for a binary classification vary as the threshold changes, including any amounts and costs associated with the categories in the matrix:

Fig. 8 — Live example of an Interactive Confusion Matrix

As can be seen from Figure 8, the threshold value is associated with a slider ranging from 0 to 1, with a user-chosen step (in the example, 0.02). As the slider moves, some of the displayed measurements and metrics change accordingly.

The plot is divided into two parts:

At the bottom is the confusion matrix, which highlights both the number of observations that fall into the TP, TN, FP, FN categories and possible measures of amount or cost associated with the individual observation. This is because it might be interesting to see how the total amount associated with each of the categories varies by summing the amounts of individual observations taken from a column of the training dataset. Similarly, it might be interesting to see how the total cost associated with each of the categories varies by summing either an average cost per observation or the values from a list of costs for each observation. These values obviously vary as the threshold changes and each shows next to it the percentage value it represents of the total
At the top are three tables. The first on the left contains all those metrics that depend on the threshold value and will therefore vary as the slide moves (e.g., Accuracy, F1-score, etc.). The middle table, on the other hand, contains all those metrics that are invariant as the threshold changes (e.g. ROC AUC or PR AUC). The third and final table on the right appears only if you specify it during the call to the function that generates the plot and contains all the threshold values that are optimal with respect to the metrics shown in the column on the left. The calculation of optimal threshold values are done via GHOST (as described in the previous section). Since this calculation can take a non-negligible amount of time, it is possible to select specific metrics to be optimized, as well as being able to select all of them. Currently the metrics that can be optimized are: Cohen’s kappa, Matthews correlation coefficient, ROC curve, F1-score, F2-score, F0.5-score and Costs.

The convenience of using the Interactive Confusion Matrix for the analysis of the predictions of a binary classifier is unquestionable. Similarly, a plot displaying the trend of amounts or costs as the threshold changes could also be useful.

The Interactive Confusion Line Chart

From the analysis done through the Interactive Confusion Matrix, the analyst may be interested not only in seeing a value, but also in visualizing the trend of possible amounts or costs associated with each category of the confusion matrix as the threshold value changes. And that is why the binclass-tools package also allow you to plot the Interactive Confusion Line Chart:

Fig. 9 — Live example of an Interactive Confusion Line Chart

You can observe from Figure 9 that, in addition to a dot for each of the 4 plots representing the amount/cost at the selected threshold value, there are black “diamonds” indicating the first threshold value in which there is a swap of the amount and cost curves. The curve swapping points can also be more than one.

Should the analyst wish to focus on total amount or cost values for any combination of categories in the confusion matrix, there is another very useful plot available.

The Interactive Amount-Cost Line Chart

Suppose you want to help a company’s team analyze possible frauds using a fraud detection classifier. Assuming that the classifier detects fraudulent classes as positive, the analyst might make the following points:

If the model detects TPs, the transaction amount is “saved” from what could have been fraud, so it can be considered a gain for the company.
All those observations that are classified as good but are actually frauds (thus the FNs) are for all intents and purposes losses equal to the total amount of the related transactions. So, they are therefore costs.
Also the checks that the team have to do for all those transactions that the model predicts as frauds, but which in fact are not (the FPs), are costs, albeit smaller in magnitude than those associated with fraud. A fixed cost per inspection is often considered in these cases.

If now the analyst want to compare the performance of what he considers a gain (amount of TPs) with the performance of what he considers a loss (amount of FNs + fixed cost per FP), he can do so thanks to the Interactive Amount-Cost Line Chart:

Fig. 10 — Example of an Interactive Amount-Cost Line Chart

Also in this plot, the black “diamonds” indicate the first threshold value at which a swap of the amount and cost curves occurs.

At this point, the demands on profit and loss analysis based on the classifier can be easily met using the above mentioned plots.

It had been on my mind for some time to put together functions in a library that would allow dynamic analysis of the results provided by a binary classifier. The implementation in Python (or R) was always the blocking step, due to the lack of my available time. The possibility of carrying out this project has become more realistic since I obtained the commitment of a colleague to help me in developing the above functions. That is why the realization of all these ideas was possible thanks to Greta Villa’s expertise in Python.

That said, I decided to make the project available as opensource on GitHub for two reasons:

I wanted to share with the entire Data Scientists community a tool that would make analysis on binary classifiers easier.
I rely on possible help from the community to suggest new features, improve existing code, or help us develop future versions of the package.

The binclass-tools package is published on PyPI, so you just need the following line of code to install it in your Python environment:

pip install binclass-tools

For more details of how to use the functions contained in the package, refer to its GitHub page here:

It is our intention to add interactive ROC and Precision-Recall plots, and wrappers that simplify the operation of calibrating a model in future releases.

Any feedback on the package is welcome!