Essential Evaluation Metrics for Classification Problems in Machine Learning | by Aaron Zhu | Mar, 2023

By Jessie Hobb On Mar 10, 2023

A Comprehensive Guide to Understanding and Evaluating the Performance of Classification Models

Are you confused about the terms used in evaluating the performance of machine learning models? Do you get lost in the sea of confusion when you come across the terms confusion matrix, precision, recall, specificity and sensitivity? Well, worry no more, because in this blog post we will dive deep into these evaluation metrics and help you make sense of these terms.

What is a Confusion Matrix?

Let’s start with the confusion matrix. It is a table that is used to evaluate the performance of a classification model. It contains four values: true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). A true positive is when the model correctly predicts the positive class, a true negative is when the model correctly predicts the negative class, a false positive is when the model predicts the positive class but it is actually negative, and a false negative is when the model predicts the negative class but it is actually positive.

What is Accuracy?

Accuracy is a measure of how well a model is performing overall. It is the proportion of correct predictions made by the model out of all the predictions made. In other words, it is the number of true positives and true negatives divided by the total number of predictions.

Accuracy = (TP+TN)/(TP+TN+FP+FN)

What is Precision?

Precision is a measure of how accurate the positive predictions of a model are. It is calculated as the ratio of true positives to the sum of true positives and false positives. A high precision indicates that the model correctly predicts the positive class most of the time.

Precision = TP / (TP + FP)

What is Recall?

Recall is a measure of how well the model is able to identify the positive class. It is calculated as the ratio of true positives to the sum of true positives and false negatives. A high recall indicates that the model is able to identify most of the positive instances.

Recall = TP / (TP + FN)

What is Specificity?

Specificity is a measure of how well the model is able to identify the negative class. It is calculated as the ratio of true negatives to the sum of true negatives and false positives. A high specificity indicates that the model is able to identify most of the negative instances.

Specificity = TN / (TN + FP)

You may think of specificity as the recall with a different definition of the positive and negative labels, i.e., the positive label is considered negative, and the negative label is considered positive.

What is Sensitivity?

Sensitivity is another term used for recall, especially in medical contexts where it refers to the ability of a medical test to detect a disease or condition in people who actually have the disease or condition.

Sensitivity = TP / (TP + FN)

Summary Table:

When is Precision-Recall a better measure than Accuracy?

Precision-Recall is used instead of accuracy when the data is imbalanced, meaning there are significantly more samples of one class than the other.

In such cases, accuracy can be misleading, as a model can achieve high accuracy by simply predicting the majority class. For example, in a binary classification problem with 90% of the samples belonging to the negative class, a model that always predicts negative will have an accuracy of 90%, even if it is not making any correct positive predictions.

When the data is imbalanced, we could have a model with very high accuracy. But the model is useless because of low precision-recall values.

Precision-Recall is a better measure of a model’s performance in such cases because it takes into account the proportion of true positives and true negatives, as well as false positives and false negatives, which are more critical in imbalanced datasets. Precision measures the proportion of true positive predictions made out of all positive predictions made, while recall measures the proportion of true positive predictions made out of all actual positive samples.

Precision-Recall is more suitable for evaluating the performance of a model in imbalanced datasets, while accuracy is more appropriate when the classes are balanced.

What is Precision-Recall Trade-off?

Ideally, we would like to have both high precision and high recall for our model. Achieving both simultaneously often is not possible.

The Precision-Recall trade-off arises because optimizing one metric often comes at the expense of the other.

Here is why

If a model is more conservative in its predictions, it may achieve a higher precision by reducing the number of false positives, but this may also result in a lower recall, since it may miss some true positive instances.
Conversely, suppose a model is more aggressive in its predictions. In that case, it may achieve a higher recall by capturing more true positive instances, but this may also result in a lower precision, since it may make more false positive predictions.

Therefore, to determine whether we should prioritize the precision or recall value, we need to evaluate the cost of false positives and false negatives.

In case 1, a medical test is designed to detect a disease in people.

The cost of false negative cases might be — patients who are sick don’t receive the right treatment, which might cause more people to be infected if the disease is contagious.
On the other hand, the cost of false positive cases might be — wasting resources treating healthy people and unnecessarily quarantining themselves.

Therefore, the cost of false negative cases is much higher than the cost of false positive cases. In this case, paying more attention to the recall value makes more sense.

In case 2, a bank designs an ML model to detect credit card fraud.

The cost of false negative cases might be — the bank loses money on the fraudulent transactions.
The cost of false positive cases might be — the false fraud alert hurts the customers’ experience, which causes a decrease in customer retention.

Therefore, the cost of false positive cases is much higher than the cost of false negative cases. Based on a study about credit card fraud, false-positive credit card fraud costs 13 times more in lost income than true fraud. In this case, paying more attention to the precision value makes more sense.

What is F1-Score?

In the above examples, we try to prioritize either recall or precision at the expense of the other measure. But there are also many situations, where both recall and precision are equally important. In such cases, we should use another measure, called F1 score.

The F1 score takes into account both precision and recall, and provides a single score that summarizes the model’s overall performance. It ranges from 0 to 1, with a score of 1 indicating perfect precision and recall.

The formula for calculating the F1 score is:

It can also be calculated as the harmonic mean of precision and recall:

What are Receiver Operating Characteristic Curve (ROC Curve) and Area Under the Curve (AUC)?

A Receiver Operating Characteristic (ROC) Curve is a graphical representation of the performance of a binary classification model that predicts the probability of an event occurring. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

TPR=TP/(TP+FN)

FPR=FP/(FP+TN)

Let’s consider an example of a binary classification problem where we want to predict whether an email is spam or not spam. Let’s say we have a model that predicts the probability that an email is spam, and we want to use an ROC curve to evaluate its performance.

To create the ROC curve, we need to set a threshold value (i.e., from 0 to 1) for the predicted probability above which we classify an email as spam, and below which we classify it as not spam. The threshold value is a decision boundary that determines the trade-off between the true positive rate (TPR) and the false positive rate (FPR).

For example, if we set the threshold to 0.5, then any email with a predicted probability greater than 0.5 would be classified as spam, and any email with a predicted probability less than or equal to 0.5 would be classified as not spam. This threshold value would give us a certain TPR and FPR.

In general, as the threshold increases, TPR and FPR decrease. In the most extreme case, when the threshold value is 0, all predicted values are positive, therefore, TPR=FPR=1. Conversely, when the threshold value is 1, all predicted values are negative, therefore, TPR=FPR=0.

Suppose that, for a given dataset, we calculate the TPR and FPR for various threshold values, and we get the following results:

We can plot these values on an ROC curve, with TPR on the y-axis and FPR on the x-axis, as shown below:

As we can see from the plot, the ROC curve is a trade-off between the TPR and FPR for different threshold values.

The area under the curve (AUC) measures the overall performance of the model, with an AUC of 1 indicating perfect performance, and an AUC of 0.5 indicating random guessing (i.e., the diagonal line which represents a classifier making random predictions).

Key takeaways of ROC:

ROC works better in evaluating different models when the classes are balanced and the cost of false positives and false negatives are similar. The higher the AUC-ROC is, the better the model.
From the ROC, we can pick the optimal threshold value, which depends on how the classifier is intended to be applied — if the cost of false positives and false negatives are similar, the threshold that is close to the upper left corner of the ROC is the optimal value. If the cost of false positives and false negatives is higher, we can pick a higher threshold value. Conversely, if the cost of false negatives and false positives is higher, we can pick a lower threshold value.
ROC is threshold-invariant. It measures the performance of a model across a range of thresholds. It means we don’t need to determine a threshold using ROC in advance, unlike precision, recall, accuracy, and F1 score which are based on a specific threshold.

What is Precision-Recall Curve (PRC)?

In contrast to the ROC curve that plots TPR and FPR, the Precision-Recall Curve (PRC) curve plots precision on the y-axis and recall on the x-axis. The PRC curve shows how well the model can identify positive cases while avoiding false positives.

The area under the PRC curve can measure the performance of a model. The higher the AUC-PRC, the better the model. A model with an AUC-PRC of 0.5 is no better than random guessing, while a model with an AUC-PRC of 1.0 is perfect.

In general, as the threshold increases, the precision would increase and the recall would decrease. In the most extreme case, when the threshold value is 0, all predicted values are positive, therefore, recall = 1 and precision = 0. Conversely, when the threshold value is 1, all predicted values are negative, therefore, recall = 0 and precision = 1.

Suppose that, for a given dataset, we calculate the precision and recall for various threshold values, and we get the following results:

Key takeaways of PRC:

The choice between ROC and PRC curves depends on the problem at hand. The ROC curve is useful when the classes are balanced and the cost of false positives and false negatives are similar. The PRC curve is useful when the classes are imbalanced or the cost of false positives and false negatives are different.
By looking at the PRC, you can choose an optimal threshold that balances precision and recall according to your specific use case.

We’ve covered many evaluation metrics for classification problems. These metrics are interrelated, and each has its strengths and weaknesses in measuring the model’s accuracy. Overall, understanding these metrics is crucial in developing effective machine learning models and making informed decisions based on their predictions.

If you would like to explore more posts related to Statistics, please check out my articles:

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.