Mastering the Art of Regression Analysis: 5 Key Metrics Every Data Scientist Should Know | by Federico Trotta | Feb, 2023

By Jessie Hobb On Feb 20, 2023

The definitive guide on all the knowledge you should have on the metrics used in regression analysis

Image created by Author on Dall-E by the prompt “A futuristic robot teaching math at a blackboard”.

In the case of Supervised Learning, we can subdivide the ML problems into two subgroups: regression and classification problems.

In this article, we’ll discuss the five metrics we use in the case of regression analysis to understand if a model is good or bad to solve a particular ML problem.

But, first of all, let’s refresh what is a regression analysis.

Regression analysis is a mathematical technique useful to find a functional relationship between a dependent variable and one or more independent variable(s).

In ML we define “feature” as the independent variable and “label” (or “target”) as the dependent variable, so the aim of regression analysis is to find an estimate (a good one!) between the features and the label.

Table Of ContentsThe residuals
1. The mean squared error (MSE)
2. The root mean square error (RMSE)
3. The mean absolute error (MAE)
4. The Coefficient of Determination (R²)
5. The adjusted R²
Calculating all the Metrics in Python

Before talking about the metrics, we need to talk about the residuals.

For the sake of simplicity, let’s consider the linear regression model (but the results can be generalized for any other ML model).

So, suppose we have a dataset where the data are distributed somehow linearly. We typically find a situation like the following:

The red line is called the regression line and it is the line through which we will make our predictions. As we can see, the data points do not lie perfectly on the regression line; so we can define the residuals as the error between the regression line (the predictions) and the actual data points, in the vertical direction.

So, with respect to the above image, we mathematically define a residual as:

The definition of a residual. Image by Author.

What we would like to have is e_i=0 because this means that all the data points lie exactly on the regression line but, unfortunately, this is not possible and this is why we use the following metrics to validate our ML models, in the case of a regression problem.

We define “hat” y as the fitted or predicted value (fitted/predicted by the model: in this case, the linear regression model), while y is the true value. So, the predicted values can be calculated as:

How to calculate the predicted values. Image by Author.

where in the above formula the coefficients w (called the weight) and b (called the bias or constant) are estimated values, which means that are learned during the learning process by the ML model.

This knowledge is important because now we can define the Residual Sum of Squares (RSS) as:

The formula for the Residuals Sum of Squares. Image by Author.

Now, if we substitute inside the parenthesis the formula for the predicted values we’ve seen before we get:

The extended formula for the Residuals Sum of Squares. Image by Author.

Where the estimated coefficients w and b are the ones that minimize the RSS.

In fact, we have to remember that the process of learning requires that the chosen metrics (also called cost functions or loss functions) must be minimized.

In mathematics, minimizing a function means calculating its derivative and equaling it to 0. So, we should perform something like that:

The derivative of the RSS function with respect to w. Image by Author.

and

The derivative of the RSS function with respect to b. Image by Author.

We won’t do the calculations here; so, consider that the results of these calculations are:

The values that minimize the RSS function Image by Author.

Where, in the above formula, x and y with a “bar” above are the mean values. So they can be calculated as:

The mean value of x (it also applies to y). Image by Author.

Now, with all this in mind, we’ll define and calculate the 5 cost functions.

We’ll use 5 numbers in a table to show the differences between the various metrics. The table has the following:

The true values.
The predicted values (the values predicted by the linear regression model).

The table we’ll refer to for the following calculations. Image by Author.

NOTE: consider these data as calculated on the train set. In the following
calculations we'll give for granted that we refer just to the train set
and we won't discuss the test set.

We define the mean squared error (MSE) as follows:

The definition of the MSE. Image by Author.

Where n is the number of observations. In other words, it represents how many values in total we have. In our case, since we have a table with just 5 numbers, then n=5.

The MSE measures the average squared difference between the predicted and the actual values. In other words, it tells us how far our predictions are from the actual values, on average.

Let’s calculate it, with respect to the tabled values:

The calculation of MSE with the given numbers. Image by Author.

And we get: MSE = 51.2

The root mean square error (RMSE) is simply the root square of the MSE; so its formula is:

The definition of the RMSE. Image by Author.

Now, let’s consider the values in the above table, and calculate the RMSE:

The calculation of RMSE with the given numbers. Image by Author.

There is not a big difference between MSE and RMSE. They refer to the same quantities, and the only mathematical difference is that RMSE has a square root. However, RMSE is easier to interpret, as it is in the same units as the input values (the predicted and the true values), so is more directly comparable to them.

Let’s make an example to understand that.

Imagine that we have trained a linear regression model to predict the price of a house based on its size and number of bedrooms. We calculate the values of the MSE and RMSE and compare them.

Suppose the model predicts that a house with 1000 square feet and 2 bedrooms will have a price of 200,000 USD. However, the actual price of the house is 250,000 USD. We’ll have:

MSE for the price of the house (n=1 in this case because we calculated just one value). Image by Author.

and

RMSE for the price of the house (n=1 in this case because we calculated just one value). Image by Author.

So, here’s the point: RMSE is easily comparable with the input data because, in such cases, how would we explain USD² as a unit of measure? Is not explainable, but it is the correct one!

So, this is the difference between these two metrics.

The mean absolute error (MAE) is another way to calculate the distance between the actual data point and the predicted one. Its formula is:

The definition of the MAE. Image by Author.

Here the distance between the actual and the estimated values is calculated with the norm (sometimes called “Manhattan distance”).

As we can see from the formula, even MAE is in the same units as the input values, so is easy to interpret.

Now, let’s consider the values in the table, and calculate MAE:

The calculation of MAE with the given numbers. Image by Author.

And we get: MAE = 5.6.

Now, before explaining the other two metrics, we have to say something about the three above.

We have seen that MAE and RMSE are more easily explainable than MSE because the results we get have the same unit as the input data, but this is not the only thing we can say.

One other thing to say is that a value near 0 for any of these metrics indicates that the model’s predictions are closer to the actual values; in other words, the model predicts pretty well the data.

Instead, values that are far from 0 indicate that the model’s predictions are far from the actual values; in other words, the model badly predicts the data.

Another thing we can say is that MSE and RMSE are sensitive to outliers, because they are based on the squared differences between the predicted and true values. In cases where there are a few large errors between the actual and the predicted values, the squared errors will be very large, and this affects significantly MSE and RMSE. In these cases, it may be more appropriate to use a different MAE, which is less sensitive to outliers.

If we analyze the above table, we can see that the prediction for the fifth data point is very far off (the true value is 50 while the predicted value is 64), and this has a significant impact on the MSE but a smaller impact on MAE, as we can see from the results we have obtained.

So, one of the first things we should always do is to correctly treat the outliers (and here’s an article explaining how you can do so).

Another thing to take into account is that we won’t use a single model to solve our ML problems: typically, we start with 4–5, refine their hyperparameters and, in the end, we’ll choose the best model.

But as a starting point, as we may understand, we can’t calculate MAE, MSE, and RMSE for 4–5 models because it will be time-consuming.

So, let’s see a situation we typically face: we have decided to use a pool of 5 ML models and, for example, we have calculated MAE and get the following results:

MAE for ML_1: 115
MAE for ML_2: 351
MAE for ML_3: 78
MAE for ML_4: 1103
MAE for ML_5: 3427

We know that the value of MAE (but this applies even for MSE and RMSE) has to be as near as possible to 0; so we understand immediately that ML_1 and ML_3 are the best among the 5 we have chosen, but the question is: how good are they?

Each of these metrics can reach any value, even 1 million or more. We only know that we have to be as near as possible to 0 to say that our model is good to solve this ML problem; but how near must the result be to 0? Is an MAE of 78 enough to say that ML_3 is very good to solve this ML problem?

So, because of the fact that each of these metrics can reach any value, statisticians have defined other two metrics that have values bounded between 0 and 1. This may be more helpful for some Data Scientists when comparing the result of the metrics between different models.

We define the coefficient of determination (or R²) as follows:

The definition of the coefficient of determination. Image by Author.

where we’ve defined RSS as the residual sum of squares before. Then we have the Total Sum of Squares which is defined as:

The definition of the Total Sum of Squares. Image by Author.

The TSS is simply the variance of the predicted variable y; in fact, let’s stick altogether and multiply and divide for 1/n both the numerator and the denominator:

The modified definition of the coefficient of determination. Image by Author.

Now, the numerator is exactly the MSE and the denominator is the variance of y; so we can write:

Another form to define the coefficient of determination. Image by Author.

If R²=1 it means that MSE=0, so the model perfectly fits the data. Instead, R²=0 indicates that our model does not fit the data at all.

R² is bounded between 0 and 1, as we wanted, but only for the train set. This means that TSS>RSS or var(y)> MSE. Instead, in the test set, R² can become negative, which means that our model is badly fitting the test set (but we won’t discuss it any further here).

Now, let’s recall what we have done before. Using the table provided above we had:

MSE = 51.2
RMSE = 7.15
MAE = 5.6

So, judging from RMSE and MAE the (only) model we are using for these calculations seems good, because we are near 0.

But, if you are familiar with Mathematical Analysis you can agree that 5.6 can be considered far from 0. This is because we have no reference to judge.

Now, let’s see what happens if we calculate R².

Let’s calculate the mean value of y:

The mean value of y with the provided values in the table. Image by Author.

Now we can calculate the variance:

The variance of y with the provided values in the table. Image by Author.

We calculated MSE before (MSE = 51.2) so, finally, we have:

The calculation of the coefficient of determination with the provided values. Image by Author.

Remembering that, on the train set, R² is bounded between 0 and 1 and that the more we are near to 1 the better the model, an R² of 0.7 or higher is generally considered to be a good fit.

So, immediately and without any doubt, we can say that our model fits the data pretty well because we know that the best value we can get is 1, and since we found 0.739 as a result we can say, for comparison, that this result is pretty good.

The problem with R² is that it tends to increase when we add extra-explanatory variables to our model. This happens for a simple reason: additional variables can potentially improve the fit of the model. So, as we add more explanatory variables to a model, it has more information about the predicted variable, and this can allow it to make more accurate predictions. Then, this can lead to a decrease in the variance of the predicted variable, which can lead to an increase in the R².

To determine if a variable is explanatory for our model, we have to consider if it is likely to have an effect on the dependent variable. For example, if we are studying the relationship between income and happiness, the money spent on holidays may be considered an explanatory variable because it is likely to have an effect on happiness. On the other hand, the color of the car of the people interviewed may not be considered an explanatory variable in this context, because it is unlikely to have an effect on happiness.

To deal with this behavior of R² statisticians have defined the adjusted R².

The adjusted R² is a special form of R² we use to correct the overestimation in R² that can be due to new explanatory variables in the model. We can define it as follows:

The definition of the adjusted R-squared. Image by Author.

where:

n is the number of samples in our data.
p is the number of features (sometimes called predictors in the case of a regression problem: this is why we use the letter p).

Let’s say we have a model with 2 independent variables and a sample size of 10, and R² for this model is 0.8. We have:

The calculation of the adjusted R-squared. Image by Author.

In general, it is recommended to use the adjusted R² when we have a large number of independent variables in the model, because it gives a more accurate measure than the “standard” R².

Luckily to use, in Python we don’t have to calculate these metrics: the library sklearn does it for use, except for the adjusted R²: in this case, we have to calculate the parameters of the formula by coding them.

Let’s see an example. We generate some random data, fit the train set with a linear regression model, and print the results of all the metrics.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score# generate random data
np.random.seed(42)
X = np.random.rand(100, 5)
y = 2*X[:,0] + 3*X[:,1] + 5*X[:,2] + np.random.rand(100)
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# fit linear regression model to the train set
reg = LinearRegression()
reg.fit(X_train, y_train)
# make predictions on the train set
y_pred_train = reg.predict(X_train)
# calculate metrics on the train set with 2 decimals
mae_train = round(mean_absolute_error(y_train, y_pred_train), 2)
mse_train = round(mean_squared_error(y_train, y_pred_train), 2)
rmse_train = round(np.sqrt(mse_train), 2)
r2_train = round(r2_score(y_train, y_pred_train), 2)
# calculate adjusted r-squared on the train set with 2 decimals
n = X_train.shape[0] #number of features
p = X_train.shape[1] #number of predictors
adj_r2_train = round(1 - (1 - r2_train) * (n - 1) / (n - p - 1), 2)
# print the results
print("Train set - MAE:", mae_train)
print("Train set - MSE:", mse_train)
print("Train set - RMSE:", rmse_train)
print("Train set - r-squared:", r2_train)
print("Train set - adjusted r-squared:", adj_r2_train)
>>>
Train set - MAE: 0.23
Train set - MSE: 0.07
Train set - RMSE: 0.26
Train set - r-squared: 0.98
Train set - adjusted r-squared: 0.98

Now, in this case, there is no difference between R² and the adjusted R² because the data were created on purpose and, also, we have just 5 features.

This code was just a way to show how we can use the knowledge we got in this article in a practical case, in Python.

Also, here we can clearly see what it means for MAE, MSE, and RMSE to be near 0. As R² is 0.98, in fact, these metrics are “0.xx” which is pretty much near 0 than 5.6, as we found in the tabled example.

So far, we’ve seen a complete overview of all the metrics related to regression analysis in Machine Learning.

Even if we came out with a very long article, we hope that this can help the reader better understand what’s under the hood on these metrics, to better understand how to use them, and the differences between them.

Need content in Python and Data Science to start or boost your career? Here are some of my articles that can help you:

Python:

Data Science:

Consider becoming a member: you could support me with no additional fee. Click here to become a member for less than 5$/month so you can unlock all the stories, and support my writing.