Classification Metrics: The Complete Guide For Aspiring Data Scientists | by Federico Trotta

The first metric we take into account is accuracy. Let’s see the formula:

The formula of accuracy written by the Author on embed-dot-fun.

So, accuracy is a measure of how often our ML model is correct in its predictions.

For example, let’s say we have a dataset of emails that are labeled as either spam or not spam. We can use ML to predict whether new emails are spam or not. If the model correctly predicts that 80 out of 100 emails are spam, and correctly predicts that 90 out of 100 emails are not spam, then its accuracy would be:

The calculation of our example.

This means that our model is able to correctly predict the class of an email 85% of the time. A high accuracy score (near 1) indicates that the model is performing well, while a low accuracy score (near 0) indicates that the model needs to be improved. However, accuracy alone may not always be the best metric to evaluate a model’s performance, especially in imbalanced datasets.

This is understandable because the prevalent class has “more data” labeled to it, so if our model is accurate it will make accurate predictions according to the prevalent class. In other words, our model may be biased because of the prevalent class.

Let’s make an example in Python creating a dataset for this purpose:

import numpy as np
import pandas as pd# Random seed for reproducibility
np.random.seed(42)
# Create samples
n_samples = 1000
fraud_percentage = 0.05 # Fraudolent percentage
# Create classes 
X = np.random.rand(n_samples, 10)
y = np.random.binomial(n=1, p=fraud_percentage, size=n_samples)
# Create data frame
df = pd.DataFrame(X)
df['fraudulent'] = y

We have created a simple data frame with 1000 samples that can represent the data of some credit card transactions, for example. We have, then, created a class for the fraudulent transaction which is the 5% of all the observations. So, this dataset is clearly imbalanced.

If our model is accurate it is because it’s biased by the 95% of the observations that belong to the class that represent the non-fraud transactions. So let’s split the data set, make predictions with the Logistic Regression model, and print the accuracy:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score# Split the dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit logistic regression model to train set
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
>>>
Accuracy: 0.95

So, our model is 95% accurate: hooray! Now…let’s define the other metrics and see what they tell us about this dataset.

Precision measures the ability of a classifier to not label as positive a sample that is negative. In other words, it measures the fraction of true positives among all positive predictions. Simplifying, precision tells how accurate are the positive predictions of our model. That’s the formula:

The formula for the precision score written by the Author on embed-dot-fun.

Considering an email spam classification problem, precision measures how many of the emails that the model classified as spam are actually spam.

Let’s use it in our imbalanced dataset:

from sklearn.metrics import precision_score# Calculate and print precision
precision = precision_score(y_test, y_pred)
print('Precision:', precision)
>>>
Precision: 0.0

Auch! 95% accuracy and 0% precision: what does it mean? It means that the model is predicting all samples as negative, or non-fraudulent. Which is wrong, of course. In fact, a high precision score would indicate that the model is correctly identifying a high proportion of fraudulent transactions among all transactions it predicts as fraudulent.

Then, we have the recall metric that measures the fraction of true positives among all actual positives. In other words, it measures how many of the actual positives are correctly predicted. Simplifying, recall tells us how well our model is able to find all the positive instances in our data. Here’s the formula:

The formula for the recall score written by the Author on embed-dot-fun.

Considering an email spam classification problem, recall measures how many of the actual spam emails in the dataset are correctly identified as spam emails by our ML classifier.

Let’s say that we have a dataset of 1000 emails, where 200 of them are spam and the rest are legitimate. We train a machine learning model to classify emails as spam or not spam, and it predicts that 100 of the emails are spam.

Precision would tell us how many of those 100 predicted spam emails are actually spam. For example, if 90 out of the 100 predicted spam emails are actually spam, then the precision would be 90%. This means that out of all emails that the model predicted as spam, 90% of them are actually spam.

Recall, on the other hand, tells us how many of the actual spam emails the model correctly identified as spam. For example, if out of the 200 actual spam emails, the model correctly identified 150 of them as spam, then the recall would be 75%. This means that out of all actual spam emails, the model correctly identified 75% of them as spam.

Now, let’s use recall in our imbalanced dataset:

from sklearn.metrics import recall_score# Calculate and print recall
recall = recall_score(y_test, y_pred)
print('Recall:', recall)
>>>
Recall: 0.0

Again: we have 95% of accuracy and 0% recall. What does it mean? As before, it means that the model is not correctly identifying any fraudulent transactions, and is instead predicting all transactions as non-fraudulent. In fact, a high recall score would indicate that the model is correctly identifying a high proportion of fraudulent transactions among all actual fraudulent transactions.

So, in practice, we want to achieve a balance between precision and recall depending on the problem we’re studying. To do so, we often refer to other two metrics that consider both of them: the confusion matrix and f1-score. Let’s see them.

F1-score is an evaluation metric in Machine Learning that combines precision and recall into a single value in the range 0–1. If f1-score results in a 0 value, then our ML model has low performance. If f1-score results in a 1 value, then our ML model has high performance.

This metric balances precision and recall by calculating their harmonic mean. This is a type of average that is more sensitive to low values, and this is why this metric is particularly suitable for imbalanced datasets.

Let’s see its formula:

The formula for the f1-score written by the Author on embed-dot-fun.

Now, we know the results we’ll gain for our imbalanced dataset (f1-score will be 0). But let’s see how to use it in Python:

from sklearn.metrics import f1_score# Calculate and print f1-score
f1 = f1_score(y_test, y_pred)
print('F1 score:', f1)
>>>
F1 score: 0.0

In the context of a spam classifier, let’s say we have a dataset of 1000 emails, where 200 of them are spam and the rest are legitimate. We train a machine learning model to classify emails as spam or not spam, and it predicts that 100 of the emails are spam.

To calculate the F1-score of the spam classifier, we first need to calculate its precision and recall. Let’s say that out of the 100 predicted spam emails, 80 are actually spam. So, the precision is 80%. Also, let’s say that out of the 200 actual spam emails, the model correctly identified 150 of them as spam. So, the recall is 75%.

Now we can calculate the f1-score:

The calculation for the f1-score for our spam classifier written by the Author on embed-dot-fun.

Which is a pretty good result as we’re near 1.

The confusion matrix is a table that summarizes the performance of a classification model by showing the number of true positives, false positives, true negatives, and false negatives.

In a binary classification problem, the confusion matrix has two rows and two columns and it’s displayed like so:

The confusion matrix. Image by Author.

Using the spam email classification example, let’s say that our model predicted 100 emails as spam, out of which 80 were actually spam, and 900 emails as not spam, out of which 20 were actually spam.

The confusion matrix for this example would look like that:

The confusion matrix for our spam classification problem. Image by Author.

Now, this is a very useful visualization tool for classification for two reasons:

It can help us calculate precision and recall by visualizing it
It immediately tells us what matters, without any calculations. What we want in a classification problem, in fact, is TN and TP to be the highest possible while FP and FN to be the lowest possible (as much as near to 0). So, if the values on the main diagonal are high and the values on the other positions are low, then our ML model has good performance.

This is the reason why I love the confusion matrix: we just need to watch the main diagonal (from top-left to low-right) and non-diagonal values to evaluate the performance of an ML classifier.

Considering our imbalanced dataset, we obtained 0 for precision and recall and we said that it means that the model is not correctly identifying any fraudulent transactions, and is instead predicting all transactions as non-fraudulent.

This may be really difficult to visualize, because of the formulas of precision and recall. We have to have them clear in our minds. Since it’s not easy for me to have this kind of visualization, let’s apply the confusion matrix to our example and see what happens:

from sklearn.metrics import confusion_matrix# Calculate and print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:\n', cm)
>>>
Confusion matrix:
[[285   0]
[ 15   0]]

See what happens?! We can clearly say that our model is not performing well because, while it captures 285 TNs it captures 0 TPs! That’s the visual power of the confusion matrix!

There is also another way to display the confusion matrix, and I really love it because it improves the visualization experience. Here’s the code:

from sklearn.metrics import ConfusionMatrixDisplay# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot confusion matrix
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

The visualization of our confusion matrix. Image by Author.

This kind of visualization is very useful in the case of multi-class classification problems. Let’s see one example:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay# Generate random data with 3 classes
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_clusters_per_class=1, n_informative=5,
class_sep=0.5, random_state=42)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
from sklearn.linear_model import LogisticRegression
# Train a logistic regression model on the training data
clf = LogisticRegression(random_state=42).fit(X_train, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['Class 0', 'Class 1', 'Class 2'])
disp.plot()

The visualization of our confusion matrix for a three-class problem. Image by Author.

In these cases is not easy to understand what are the TPs, the TNs, and so on because we have three classes. Anyway, we can simply refer to the values on the main diagonal and to the non-diagonal ones. In this case, on the main diagonal, we have 49, 52, and 44 which are values much higher than the non-diagonal ones, telling us that this model is performing well (also note we’ve calculated the confusion matrix on the test set!).

There are a couple of metrics that, in my personal opinion, are more suitable if used in some particular cases: sensitivity and specificity. Let me talk about them, and then we’ll discuss the usability in particular cases.

Sensitivity is the ability of a classifier to find all the positive samples:

The formula of the sensitivity written by the Author on embed-dot-fun.

Wait a second! But isn’t it the recall?!?

Yes, it is. It’s not a mistake. This is why I’m telling you that these metrics are more suitable for particular cases. But let me go on.

We define specificity as the ability of a classifier to find all the negative samples:

The formula of the specificity written by the Author on embed-dot-fun.

So, both of them, describe the “precision” of a test: sensitivity describes the probability of a positive test. Specificity of a negative one.

In my experience, these metrics are more suitable for classifiers used in the medical field, biology, and so on.

For example, let’s take into account a COVID test. Consider this approach (which can be considered Bayesian, but let’s skip that): you make a COVID test and the result is positive. Question: what’s the probability to get a positive test? And what’s the probability to get a negative test?

In other words: what are the sensitivity and the specificity of the tool you used to get the result?

Well, you may ask yourself: what kind of question are you asking, Federico?

Let me make an example I lived last summer.

Here in Italy, a positive COVID test had to be certified by someone (let’s skip the reasons for that): a hospital or a pharmacy, typically. So, when we had the symptoms what we generally did here was test for COVID at home (3-5€ COVID test), then go to a pharmacy and confirm (15€ COVID test).

So, last July I had symptoms after my wife and daughters tested positive. So I tested home and resulted positive. Then, immediately went to the pharmacy to confirm, and…resulted negative!

How is that possible? Easy: the tool I used at home for the COVID test was more sensitive than the other used by the pharmacist (or, the test used by the pharmacist was more specific than the one I used).

So, as per my experience, these metrics are particularly suitable for measuring instruments of any kind (mechanical, electrical, etc…) and/or in some particular fields (like biology, medicine, etc…). Also, remembering that those metrics use TP, TN, FP, and FN as precision and recall: this stresses again the fact that these are more suitable in the case of a binary classification problem.

Of course, I’m not telling you that sensitivity and specificity must be used only in the above-mentioned cases. They’re just more suitable, in my experience.

Log loss — sometimes called cross-entropy — is an important metric in classification, and is based on probability. This score compares the predicted probability for each class to the actual class labels.

Let’s see the formula:

The formula of the Log Loss written by the Author on embed-dot-fun.

Where we have:

n is the total number of observations, and i is a single observation.
y is the true value.
p is the predicted probability.
Ln is the natural logarithm.

To calculate the predicted probability p, we need to use an ML model that can actually calculate probabilities, like Logistic Regression, for example. In this case, we need to use the predict_proba() method like so:

from sklearn.linear_model import LogisticRegression# Invoke logistic regression model
model = LogisticRegression()
# Fit the data on the train set
model.fit(X_train, y_train) 
# Calculate probabilities
y_prob = model.predict_proba(X_new)

So, suppose we have a binary classification problem and suppose we calculate the probabilities via the Logistic Regression model, and suppose the following table represents our results:

A table showing actual labels and probabilities calculated via the Logistic Regression model. Image by Author.

The calculation we’d perform to obtain the Log Loss is as follows: