Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists

All you need to know about Linear Regression is here (including an application in Python)

If you’re approaching Machine Learning, one of the first models you may encounter is Linear Regression. It’s probably the easiest model to understand, but don’t underestimate it: there are a lot of things to understand and master.

If you’re a beginner in Data Science or an aspiring Data Scientist, you’re probably facing some difficulties because there are a lot of resources out there, but are fragmented. I know how you’re feeling, and this is why I created this complete guide: I want to give you all the knowledge you need without searching for anything else.

So, if you want to have complete knowledge of Linear Regression this article is for you. You can study it deeply and re-read it whenever you need it the most. Also, consider that, to cover this topic, we’ll need some knowledge generally associated with regression analysis: we’ll cover it in deep.

And…you’ll excuse me if I’ll link a resource you’ll need: in the past, I’ve created an article on some topics related to Linear Regression so, to have a complete overview, I advise you to read it (I’ll link later when we’ll need it).

Table of Contents:What do we mean by "regression analysis"?
Understanding correlation
The difference between correlation and regression
The Linear Regression model
Assumptions for the Linear Regression model
Finding the line that best fits the data
Graphical methods to validate your model
An example in Python

Here we’re studying Linear Regression, but what do we mean by “regression analysis”? Paraphrasing from Wikipedia:

Regression analysis is a mathematical technique used to find a functional relationship between a dependent variable and one or more independent variable(s).

In other words, we know that in mathematics we can define a function like so: y=f(x). Generally, y is called the dependent variable and x the independent. So, we express y in relationship with x, using a certain function f. The aim of regression analysis is, then, to find the function f .

Now, this seems easy but is not. And I know you know it. And the reason why is not easy is:

We know x and y. For example, if we are working with tabular data (with Pandas, for example) x are the features and y is the label.
Unfortunately, the data rarely follow a very clear path. So our job is to find the best function f that approximates the relationship between x and y.

So, let me summarize it: regression analysis aims to find an estimated relationship (a good one!) between the dependent and the independent variable(s).

Now, let’s visualize why this process may be difficult. Consider the following code and its outcome:

import numpy as np
import matplotlib.pyplot as plt# Create random linear data
a = 130
x = 6*np.random.rand(a,1)-3
y = 0.5*x+5+np.random.rand(a,1)
# Labels
plt.xlabel('x')
plt.ylabel('y')
# Plot a scatterplot
plt.scatter(x,y)

The outcome of the above code. Image by Author.

Now, tell me: can the relationship between x and y be a line? So…can this data be approximated by a line? Like the following, for example:

A line approximating the given data. Image by Author.

Stop reading for a moment and think about that.

Well, it could. And how about the following one?

A curve approximating the given data. Image by Author.

Well, even this could! So, what’s the best one? And why not another one?

This is the aim of regression: to find the best-estimated function that can approximate the given data. And it does so using some methodologies: we’ll cover them later in this article. We’ll apply them to the Linear Regression model but some of them can be used with any other regression technique. Don’t worry: I’ll be very specific so you don’t get confused.

Quoting from Wikipedia:

In statistics, correlation is any statistical relationship, whether causal or not, between two random variables. Although in the broadest sense, “correlation” may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related.

In other words, correlation is a statistical measure that expresses the linear relationship between variables.

We can say that two variables are correlated if each value of the first variable corresponds to a value for the second variable, following a path. If two variables are highly correlated, the path would be linear, because the correlation describes the linear relation between the variables.

The math behind the correlation

This is a comprehensive guide, as promised. So, I want to cover the math behind the correlation, but don’t worry: we’ll make it easy so that you can understand it even if you’re not specialized in math.

We generally refer to the correlation coefficient, also known as the Pearson correlation coefficient. This gives an estimate of the correlation between two variables. Suppose we have two variables, a and b and they can reach n values. We can calculate the correlation coefficient as follows:

The definition of the Pearson coefficient, powered by embed-dot-fun by the Author.

Where we have:

the mean value of a(but it applies to both variables, a and b):

The definition of the mean value, powered by embed-dot-fun by the Author.

The definitions of the standard deviation and the variance, powered by embed-dot-fun by the Author.

So, putting it all together:

The definition of the Pearson coefficient, powered by embed-dot-fun by the Author.

As you may know:

the mean is the sum of all the values of a variable divided by the number of values. So, for example, if our variable a has the values 1,3,7,13,25 the mean value of a will be:

The calculation of the mean for 5 values, powered by embed-dot-fun by the Author.

the standard deviation is an index of statistical dispersion and is an estimate of the variability of a variable (or of a population, as we would say in statistics). It is one of the ways to express the dispersion of data around an index; in the case of the correlation coefficient, the index around which we calculate the dispersion is the mean (see the above formula). The more the standard deviation is high, the more the dispersion around the mean is high: the majority of the data points are distant from the mean value.

Numerically speaking, we have to remember that the value of the correlation coefficient is constrained between 1 and -1; this means that:

if r=1: the variables are highly positively correlated; it means that if one variable increases its value, the other does the same, following a linear path.
if r=-1: the variables are highly negatively correlated; it means that if one variable increases its value, the other one decreases its value, following a linear path.
if r=0: there is no correlation between the variables.

Finally, two variables are generally considered highly correlated if r>0.75.

Correlation is not causation

We need to have very clear in our mind the fact that “correlation is not causation”; we want to make an example that might be useful to remember it.

It is a hot summer; we don’t like the high temperatures in our city, so we go to the mountain. Luckily, we get to the mountain top, measure the temperature and find it’s lower than in our city. We get a little suspicious, and we decide to go to a higher mountain, finding that the temperature is even lower than the one on the previous mountain.

We try mountains with different heights, measure the temperature, and plot a graph; we find that with the height of the mountain increasing, the temperature decreases, and we can see a linear trend.

What does it mean? It means that the temperature is related to the height of the mountains, with a linear path: so there is a correlation between the decrease in temperature and the height (of the mountains). It doesn’t mean the height of the mountain caused the decrease in temperature; in fact, if we get to the same height, at the same latitude, with a hot air balloon we’d measure the same temperature.

The correlation matrix

So, how do we calculate the correlation coefficient in Python? Well, we generally calculate the correlation matrix. Suppose we have two variables, X and y; we store them in a data frame called df and we can plot the correlation matrix using seaborn like so:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])
# Create the dataframe 
df = pd.DataFrame({'x':x, 'y':y})
# Plot heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.2")

The correlation matrix for the above code. Image by Author.

If we have a 0 correlation coefficient, it means that the data points do not tend to increase or decrease following a linear path, because we have no correlation.

Let us have a look at some plots of correlation coefficients with different values (image from Wikipedia here):

Data distribution with different correlation values. Image rights for distribution here.

As we can see, when the correlation coefficient is equal to 1 or -1 the tendency of the data points is clearly to be along a line. But, as the correlation coefficient deviates from the two extreme values, the distribution of the data points deviates from a linear path. Finally, for the correlation coefficient of 0, the distribution of the data can be anything.

So, when we get a correlation coefficient of 0 we can’t say anything about the distribution of the data, but we can investigate it (if needed) with a regression analysis.

So, correlation and regression are linked but are different:

Correlation analyzes the tendency of variables to be linearly distributed.
Regression is the study of the relationship between variables.

We have two kinds of Linear Regression models: the Simple and the Multiple ones. Let’s see them both.

The Simple Linear Regression model

The goal of the Simple Linear Regression is to model the relationship between a single feature and a continuous label. This is the mathematical equation that describes this ML model:

y = wx + b

The parameter b (also called “bias”) represents the y-axis intercept (is the value of ywhen X=0), and w is the weight coefficient. Our goal is to learn the weight w that describes the relationship between x and y. This weight will later be used to predict the response for new values of x.

Let’s consider a practical example:

import numpy as np
import matplotlib.pyplot as plt# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])
# Show scatterplot
plt.scatter(x, y)

The output of the above code. Image by Author.

The question is: can this data distribution be approximated with a line? Well, we could create something like that:

import numpy as np
import matplotlib.pyplot as plt# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])
# Create basic scatterplot
plt.plot(x, y, 'o')
# Obtain m (slope) and b (intercept) of a line
m, b = np.polyfit(x, y, 1)
# Add linear regression line to scatterplot 
plt.plot(x, m*x+b)
# Labels
plt.xlabel('x variable')
plt.ylabel('y variable')

The output of the above code. Image by Author.

Well, as in the example we’ve seen above, it could be a line but it could be a general curve.

And, in a moment we’ll see how we can say if the data distribution can be better described by a line or by a general curve.

The Multiple Linear Regression model

Since reality is complex, the typical cases we’ll face are related to the Multiple Linear Regression case. We mean that the feature x is not a single one: we’ll have multiple features. For example, if we work with tabular data, a data frame with 9 columns has 8 features and 1 label: this means that our problem is eight-dimensional.

As we can understand, this case is very complicated to visualize and the equation of the line has to be expressed with vectors and matrices, becoming:

The equation of the Multiple Linear Regression model powered by embed-dot-fun by the Author.

So, the equation of the line becomes the sum of all the weights (w) multiplied by the independent variable (x) and it can even be written as the product of two matrices.

Now, to apply the Linear Regression model, our data should respect some assumptions. These are:

Linearity: the relationship between the dependent variable and independent variables should be linear. This means that a change in the independent variable should result in a proportional change in the dependent variable, following a linear path.
Independence: the observations in the dataset should be independent of each other. This means that the value of one observation should not depend on the value of another observation.
Homoscedasticity: the variance of the residuals should be constant across all levels of the independent variable. In other words, the spread of the residuals should be roughly the same across all levels of the independent variable.
Normality: the residuals should be normally distributed. In other words, the distribution of the residuals should be a normal (or bell-shaped) curve.
No multicollinearity: the independent variables should not be highly correlated with each other. If two or more independent variables are highly correlated, it can be difficult to distinguish the individual effects of each variable on the dependent variable.

Unfortunately, testing all these hypotheses is not always possible, especially in the case of the Multiple Linear Regression model. Anyway, there is a way to test all the hypotheses. It’s called the p-value test, and maybe you heard of that before. Anyway, we won’t cover this test here for two reasons:

It’s a general test, not specifically related to the Linear Regression model. So, it needs a specific treatment in a dedicated article.
I’m one of those (maybe one of the few) who believes that calculating the p-value is not always a must when we need to analyze data. For this reason, I’ll create in the future a dedicated article on this controversial topic. But just for the sake of curiosity, since I’m an engineer I have a very practical approach, and I like applied mathematics. I wrote an article on this topic here:

So, above we were reasoning which one of the following can be the best fit:

A comparison between models. Image by Author.

To understand if the best model is the left one (the line) or the right one (a general curve) we proceed as follows:

We split the data we have into the training and the test set.
We validate both models on both sets, testing how well our models generalize their learning.

We won’t cover the polynomial model here (useful for general curves), but consider that there are two approaches to validate ML models:

The analytical one.
The graphical one.

Generally speaking, we’ll use both to get a better understanding of the performance of the model. Anyway, generalizing means that our ML model learns from the training set and applies correctly its learning to the test set. If it doesn’t, we try another ML model. Here’s the process:

The workflow of training and validating ML models. Image by Author.

This means that an ML model generalizes well when it has good performances on both the training and the test set.

I’ve discussed the analytical way to validate an ML model in the case of linear regression in the following article:

I advise you to read it because we’ll use some metrics discussed there in the example at the end of this article.

Of course, the metrics discussed can be applied to any ML model in the case of a regression problem. But you’re lucky: I’ve used the linear model as an example.

The graphical ways to validate an ML model in the case of a regression problem are discussed in the next paragraph.

Let’s see three graphical ways to validate our ML models.

1. The residual analysis plot

This method is specific to the Linear Regression model and consists in visualizing how the residuals are distributed. Here’s what we expect:

A residual analysis plot. Image by Author.

To plot this we can use the built-in function sns.residplot() in Seaborn (here’s the documentation).

A plot like that is good because we want to see randomly distributed data points along the horizontal axis. One of the assumptions of the linear regression model, in fact, is that the residuals must be normally distributed (assumption n°4 listed above). If the residuals are normally distributed, it means that the errors of the observed values from the predicted ones are randomly distributed around zero, with no clear pattern or trend; and this is exactly the case in our plot. So, in these cases, our ML model may be a good one.

Instead, if there is a particular pattern in our residual plot, our model is not good for our ML problem. For example, consider the following:

A parabolical residuals analysis plot. Image by Author.

In this case, we can see that there is a parabolic trend: this means that our model (the Linear model) is not good to solve our ML problem.

2. The actual vs. predicted values plot

Another plot we may use to validate our ML model is the actual vs. predicted plot. In this case, we plot a graph having the actual values on the horizontal axis and the predicted values on the vertical axis. The goal is to find the data points distributed as much as possible to a line, in the case of Linear Regression. We can even use the method in the case of a polynomial regression: in this case, we’d expect the data distributed as much as possible to a generic curve.

Suppose we have a result as follows:

An actual vs. predicted values plot in the case of linear regression. Image by Author.

The above graph shows that the predicted data points are distributed along a line. It is not a perfect linear distribution, so the linear model may not be ideal.

If, for our specific problem, we havey_train (the label on the training set) and we’ve calculated y_train_pred (the prediction on the training set), we can plot the following graph like so:

import matplotlib.pyplot as plt# Scatterplot of y_train and y_train_pred
plt.scatter(y_train, y_train_pred)
plt.plot(y_test, y_test, color='r') # Plot the line
# Labels
plt.title('ACTUAL VS PREDICTED VALUES')
plt.xlabel('ACTUAL VALUES')
plt.ylabel('PREDICTED VALUES')

3. The Kernel Density Estimation (KDE) plot

The last graph we want to talk about to validate our ML models is the Kernel Density Estimation (KDE) plot. This is a general method and can be used to validate both regression and classification models.

The KDE is the application of a kernel smoother for probability density estimation. A kernel smoother is a statistical method that is used to estimate a function as the weighted average of the neighbor observed data. The kernel defines the weight, giving a higher weight to closer data points.

To understand the usefulness of a smoother function, see the graph below:

The idea behind KDE. Image by Author.

It is helpful to approximate our data points with a smoothing function if we want to compare two quantities. In the case of an ML problem, in fact, we typically like to see the comparison between the actual labels and the labels predicted by our model, so we use the KDE to compare two smoothed functions.

Let’s say we have predicted our labels using a linear regression model. We want to compare the KDE for our training set’s actual and predicted labels. We can do so with Seaborn invoking the method sns.kdeplot() (here’s the documentation).

Suppose we have the following result:

A KDE plot. Image by Author.

As we can see, the comparison between the actual and the predicted label is easy to do, since we are comparing two smoothed functions; in a case like that, our model is good because the curves are very similar.

In fact, what we expect from a “good” ML model are:

The curves are similar to bell curves, as much as possible.
The two curves are similar between them, as much as possible.

Now, let’s apply all the things we’ve learned so far here. We’ll use the famous “Ames Housing” dataset, which is perfect for our scopes.

This dataset has 80 features, but for simplicity, we’ll work with just a subset of them which are:

Overall Qual: it is the rating of the overall material and finish of the house on a scale from 1 (bad) to 10 (excellent).
Overall Cond: it is the rating of the overall condition of the house on a scale from 1 (bad) to 10 (excellent).
Gr Liv Area: it is the above-ground living area, measured in squared feet.
Total Bsmt SF: it is the total basement area, measured in squared feet.
SalePrice: it is the sale price, in USD $.

We’ll consider our SalePrice column as the target (label) variable, and the other columns as the features.

Exploratory Data Analysis EDA

Let’s import our data, create a subset with the mentioned features, and display some statistics:

import pandas as pd# Define the columns
columns = ['Overall Qual', 'Overall Cond', 'Gr Liv Area',
'Total Bsmt SF', 'SalePrice']
# Create dataframe
df = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',
sep='\t', usecols=columns)
# Show statistics
df.describe()

Statistics of the dataset. Image by Author.

An important observation here is that the mean values for all labels have a different range (the Overall Qual mean value is 6.09 while Gr Liv Area mean value is 1499.69). This tells us an important fact: we have to scale the features.

Data preparation

What does “features scaling” mean?

Scaling a feature implies that the feature range is scaled between 0 and 1 or between 1 and -1. There are two typical methods to scale the features:

Mean normalization: Mean normalization is a method of scaling numeric data so that it has a minimum value of zero and a maximum value of one and all the values are normalized around the mean value. Suppose c is a value reached by our feature; to scale around the mean (c′ is the new value of c after the normalization process):

The formula for the mean normalization, powered by embed-dot-fun by the Author.

Let’s see an example in Python:

import numpy as np# Create a list of numbers
data = [1, 2, 3, 4, 5]
# Find min and max values
data_min = min(data)
data_max = max(data)
# Normalize the data
data_normalized = [(x - data_min) / (data_max - data_min) for x in data]
# Print the normalized data
print(f'normalized data: {data_normalized}')
>>>
normalized data: [0.0, 0.25, 0.5, 0.75, 1.0]

Standardization (or z-score normalization): This method transforms a variable so that it has a mean of zero and a standard deviation of one. The formula is the following (c′c’c′ is the new value of ccc after the normalization process):

The formula for the standardization, powered by embed-dot-fun by the Author.

Let’s see an example in Python:

import numpy as np# Original data
data = [1, 2, 3, 4, 5]
# Calculate mean and standard deviation
mean = np.mean(data)
std = np.std(data)
# Standardize the data
data_standardized = [(x - mean) / std for x in data]
# Print the standardized data
print(f'standardized values: {data_standardized}')
print(f'mean of standardized values: {np.mean(data_standardized)}')
print(f'std. dev. of standardized values: {np.std(data_standardized): .2f}')
>>>
standardized values: [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
mean of standardized values: 0.0
std. dev. of standardized values:  1.00

As we can see, the normalized data have a mean of 0 and a standard deviation of 1, as we wanted. The good news is that we can use the library scikit-learn to standardize the features, and we’re going to do it in a moment.

Features scaling is an important thing to do when working on an ML problem, for a simple reason:

If we perform exploratory data analysis with features that are not scaled, when calculating the mean values (for example, during the calculation of the coefficient of correlation) we’ll get numbers that are very different from each other. If we take a look at the statistics we’ve got above when we’ve invoked the df.describe() method, we can see that, for each column, we get a very different value of the mean. If we scale or normalize the features, instead, we’ll get 0s, 1s, and -1s: and this will help us mathematically.

Now, this dataset has some NaN values. We won’t show it for brevity (try it on your own), but we’ll remove them. Also, we’ll calculate the correlation matrix:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np# Drop NaNs from dataframe
df = df.dropna(axis=0)
# Apply mask
mask = np.triu(np.ones_like(df.corr()))
# Heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.1", mask=mask)

The correlation matrix for our data frame. Image by Author.

So, with np.triu(np.ones_like(df.corr())) we have created a mask that it’s useful to display a triangular correlation matrix, which is more readable (especially when we have much more features than in this case).

So, there is a moderate 0.6 correlation between Total Bsmt SF and SalePrice, quite a high 0.7 correlation between Gr Liv Area and SalePrice, and a high correlation 0.8 between Overall Qual and SalePrice; Also, there is a moderate correlation between Overall Qual and Gr Liv Area 0.6 and 0.5 between Overall Qual and Total Bsmt SF.

Here there’s no multicollinearity, so no features are highly correlated with each other (so, our features satisfy the hypothesis n°5 listed above). If we’d found some highly correlated features, we could delete them because two highly correlated features have the same effect on the label (this applies to every general ML model: if two features are highly correlated, we can drop one of the two).

Finally, we subdivide the data frame dfinto X ( the features) and y(the label) and scale the features:

from sklearn.preprocessing import StandardScaler# Define the features
X = df.iloc[:,:-1]
# Define the label
y = df.iloc[:,-1]
# Scale the features
scaler = StandardScaler() # Call the scaler
X = scaler.fit_transform(X) # Fit the features to scale them

Fitting the linear regression model

Now we have to split the features X into the training and the test set and we’re fitting them with the Linear Regression model. Then, we calculate R² for both sets:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit the LR model
reg = LinearRegression().fit(X_train, y_train)
# Calculate R^2
coeff_det_train = reg.score(X_train, y_train)
coeff_det_test = reg.score(X_test, y_test)
# Print metrics
print(f" R^2 for training set: {coeff_det_train}")
print(f" R^2 for test set: {coeff_det_test}")
>>>
R^2 for training set:  0.77
R^2 for test set:  0.73

Notes:
1) your results can be slightly different due to the stocastical
nature of the ML models.2) here we can see generalization on action: 
we fitted the Linear Regression model to the train set with
reg = LinearRegression().fit(X_train, y_train).
The, we've calculated R^2 on the training and test sets with:
coeff_det_train = reg.score(X_train, y_train)
coeff_det_test = reg.score(X_test, y_test
In other words: we don't fit the data to the test set.
We fit the data to the training set and we calculate the scores
and predictions (see next snippet of code with KDE) on both sets
to see the generalization of our modelon new unseen data
(the data of the test set).

So we get R² of 0.77 on the training test and 0.73 on the test set which are quite good, suggesting the Linear model is a good one to solve this ML problem.

Let’s see the KDE plots for both sets:

# Calculate predictions
y_train_pred = reg.predict(X_train) # train set
y_test_pred = reg.predict(X_test) # test set# KDE train set
ax = sns.kdeplot(y_train, color='r', label='Actual Values') #actual values
sns.kdeplot(y_train_pred, color='b', label='Predicted Values', ax=ax) #predicted values
# Show title
plt.title('Actual vs Predicted values')
# Show legend
plt.legend()

KDE for the training set. Image by Author.

# KDE test set
ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values
sns.kdeplot(y_test_pred, color='b', label='Predicted Values', ax=ax) #predicted values# Show title
plt.title('Actual vs Predicted values')
# Show legend
plt.legend()

KDE for the test set. Image by Author.

Regardless of the fact that we’ve obtained an R² of 0.73 on the test set which is good (but remember: the higher, the better), this plot shows us that the linear model is indeed a good model to solve this ML problem. This is why I love the KDE plot: is a very powerful tool, as we can see.

Also, this shows why shouldn’t rely on just one method to validate our ML model: a combination of one analytical method with one graphical one generally gives us the right insights to decide whether to change our ML model or not. In this case, the Linear Regression model is perfect to make predictions.

I hope you’ll find useful this article. I know it’s very long, but I wanted to give you all the knowledge you need on this topic, so that you can return to it whenever you need it the most.

Some of the things we’ve discussed here are general topics, while others are specific to the Linear Regression model. Let’s summarize them:

The definition of regression is, of course, a general definition.
Correlation is generally referred to as the Linear model. In fact, as we said before, correlation is the tendency of two variables to be linearly dependent. However, there are ways to define non-linear correlations, but we leave them for other articles (but, as knowledge for you: just consider that they exist).
We’ve discussed the Simple and the Multiple Linear Regression models with their assumptions (the assumptions apply to both models).
When talking about how to find the line that best fits the data, we’ve referred to the article “Mastering the Art of Regression Analysis: 5 Key Metrics Every Data Scientist Should Know”. Here, we find all the metrics to know to solve a regression analysis. So, this is a generical topic that applies to any regression model, including the Linear one, of course.
We’ve shown three methods to validate our ML models: 1) The residual analysis plot: which applies to Linear Regression models, 2) The actual vs. predicted values plot: which can be applied to Linear and Polynomial models, 3) the KDE plot: this can be applied to any ML model, even in the case of a classification problem

Finally, I want to remind you that we’ve spent a couple of lines stressing the fact that we can avoid using p-values to test the hypotheses of our ML models. I’m writing an article on this topic very soon, but, as you can see, the KDE has shown us that our Linear model is good to solve this ML problem, and we haven’t validated our hypothesis with p-values.

So far in this article, we’ve used some plots. You can clone this repo I’ve created so that you can import the code and use it to easily plot the graphs. If you have some difficulties, you find examples of usages on my projects on GitHub. If you have any other difficulties, you can contact me and I’ll help you.

Subscribe to my newsletter to get more on Python & Data Science.
Found it useful? Buy me a Ko-fi.
Liked the article? Join Medium through my referral link: unlock all the content on Medium for 5$/month (with no additional fee).
Find/contact me here.

All you need to know about Linear Regression is here (including an application in Python)

Image by Dariusz Sankowski on Pixabay

Table of Contents:What do we mean by "regression analysis"?
Understanding correlation
The difference between correlation and regression
The Linear Regression model
Assumptions for the Linear Regression model
Finding the line that best fits the data
Graphical methods to validate your model
An example in Python

Here we’re studying Linear Regression, but what do we mean by “regression analysis”? Paraphrasing from Wikipedia:

Regression analysis is a mathematical technique used to find a functional relationship between a dependent variable and one or more independent variable(s).

Now, this seems easy but is not. And I know you know it. And the reason why is not easy is:

We know x and y. For example, if we are working with tabular data (with Pandas, for example) x are the features and y is the label.
Unfortunately, the data rarely follow a very clear path. So our job is to find the best function f that approximates the relationship between x and y.

So, let me summarize it: regression analysis aims to find an estimated relationship (a good one!) between the dependent and the independent variable(s).

Now, let’s visualize why this process may be difficult. Consider the following code and its outcome:

import numpy as np
import matplotlib.pyplot as plt# Create random linear data
a = 130
x = 6*np.random.rand(a,1)-3
y = 0.5*x+5+np.random.rand(a,1)
# Labels
plt.xlabel('x')
plt.ylabel('y')
# Plot a scatterplot
plt.scatter(x,y)

The outcome of the above code. Image by Author.

Now, tell me: can the relationship between x and y be a line? So…can this data be approximated by a line? Like the following, for example:

A line approximating the given data. Image by Author.

Stop reading for a moment and think about that.

Well, it could. And how about the following one?

A curve approximating the given data. Image by Author.

Well, even this could! So, what’s the best one? And why not another one?

Quoting from Wikipedia:

In statistics, correlation is any statistical relationship, whether causal or not, between two random variables. Although in the broadest sense, “correlation” may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related.

In other words, correlation is a statistical measure that expresses the linear relationship between variables.

The math behind the correlation

The definition of the Pearson coefficient, powered by embed-dot-fun by the Author.

Where we have:

the mean value of a(but it applies to both variables, a and b):

The definition of the mean value, powered by embed-dot-fun by the Author.

The definitions of the standard deviation and the variance, powered by embed-dot-fun by the Author.

So, putting it all together:

The definition of the Pearson coefficient, powered by embed-dot-fun by the Author.

As you may know:

the mean is the sum of all the values of a variable divided by the number of values. So, for example, if our variable a has the values 1,3,7,13,25 the mean value of a will be:

The calculation of the mean for 5 values, powered by embed-dot-fun by the Author.

the standard deviation is an index of statistical dispersion and is an estimate of the variability of a variable (or of a population, as we would say in statistics). It is one of the ways to express the dispersion of data around an index; in the case of the correlation coefficient, the index around which we calculate the dispersion is the mean (see the above formula). The more the standard deviation is high, the more the dispersion around the mean is high: the majority of the data points are distant from the mean value.

Numerically speaking, we have to remember that the value of the correlation coefficient is constrained between 1 and -1; this means that:

if r=1: the variables are highly positively correlated; it means that if one variable increases its value, the other does the same, following a linear path.
if r=-1: the variables are highly negatively correlated; it means that if one variable increases its value, the other one decreases its value, following a linear path.
if r=0: there is no correlation between the variables.

Finally, two variables are generally considered highly correlated if r>0.75.

Correlation is not causation

We need to have very clear in our mind the fact that “correlation is not causation”; we want to make an example that might be useful to remember it.

We try mountains with different heights, measure the temperature, and plot a graph; we find that with the height of the mountain increasing, the temperature decreases, and we can see a linear trend.

The correlation matrix

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])
# Create the dataframe 
df = pd.DataFrame({'x':x, 'y':y})
# Plot heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.2")

The correlation matrix for the above code. Image by Author.

If we have a 0 correlation coefficient, it means that the data points do not tend to increase or decrease following a linear path, because we have no correlation.

Let us have a look at some plots of correlation coefficients with different values (image from Wikipedia here):

Data distribution with different correlation values. Image rights for distribution here.

So, when we get a correlation coefficient of 0 we can’t say anything about the distribution of the data, but we can investigate it (if needed) with a regression analysis.

So, correlation and regression are linked but are different:

Correlation analyzes the tendency of variables to be linearly distributed.
Regression is the study of the relationship between variables.

We have two kinds of Linear Regression models: the Simple and the Multiple ones. Let’s see them both.

The Simple Linear Regression model

The goal of the Simple Linear Regression is to model the relationship between a single feature and a continuous label. This is the mathematical equation that describes this ML model:

y = wx + b

Let’s consider a practical example:

import numpy as np
import matplotlib.pyplot as plt# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])
# Show scatterplot
plt.scatter(x, y)

The output of the above code. Image by Author.

The question is: can this data distribution be approximated with a line? Well, we could create something like that:

import numpy as np
import matplotlib.pyplot as plt# Create data
x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])
# Create basic scatterplot
plt.plot(x, y, 'o')
# Obtain m (slope) and b (intercept) of a line
m, b = np.polyfit(x, y, 1)
# Add linear regression line to scatterplot 
plt.plot(x, m*x+b)
# Labels
plt.xlabel('x variable')
plt.ylabel('y variable')

The output of the above code. Image by Author.

Well, as in the example we’ve seen above, it could be a line but it could be a general curve.

And, in a moment we’ll see how we can say if the data distribution can be better described by a line or by a general curve.

The Multiple Linear Regression model

As we can understand, this case is very complicated to visualize and the equation of the line has to be expressed with vectors and matrices, becoming:

The equation of the Multiple Linear Regression model powered by embed-dot-fun by the Author.

So, the equation of the line becomes the sum of all the weights (w) multiplied by the independent variable (x) and it can even be written as the product of two matrices.

Now, to apply the Linear Regression model, our data should respect some assumptions. These are:

Linearity: the relationship between the dependent variable and independent variables should be linear. This means that a change in the independent variable should result in a proportional change in the dependent variable, following a linear path.
Independence: the observations in the dataset should be independent of each other. This means that the value of one observation should not depend on the value of another observation.
Homoscedasticity: the variance of the residuals should be constant across all levels of the independent variable. In other words, the spread of the residuals should be roughly the same across all levels of the independent variable.
Normality: the residuals should be normally distributed. In other words, the distribution of the residuals should be a normal (or bell-shaped) curve.
No multicollinearity: the independent variables should not be highly correlated with each other. If two or more independent variables are highly correlated, it can be difficult to distinguish the individual effects of each variable on the dependent variable.

It’s a general test, not specifically related to the Linear Regression model. So, it needs a specific treatment in a dedicated article.
I’m one of those (maybe one of the few) who believes that calculating the p-value is not always a must when we need to analyze data. For this reason, I’ll create in the future a dedicated article on this controversial topic. But just for the sake of curiosity, since I’m an engineer I have a very practical approach, and I like applied mathematics. I wrote an article on this topic here:

So, above we were reasoning which one of the following can be the best fit:

A comparison between models. Image by Author.

To understand if the best model is the left one (the line) or the right one (a general curve) we proceed as follows:

We split the data we have into the training and the test set.
We validate both models on both sets, testing how well our models generalize their learning.

We won’t cover the polynomial model here (useful for general curves), but consider that there are two approaches to validate ML models:

The analytical one.
The graphical one.

The workflow of training and validating ML models. Image by Author.

This means that an ML model generalizes well when it has good performances on both the training and the test set.

I’ve discussed the analytical way to validate an ML model in the case of linear regression in the following article:

I advise you to read it because we’ll use some metrics discussed there in the example at the end of this article.

Of course, the metrics discussed can be applied to any ML model in the case of a regression problem. But you’re lucky: I’ve used the linear model as an example.

The graphical ways to validate an ML model in the case of a regression problem are discussed in the next paragraph.

Let’s see three graphical ways to validate our ML models.

1. The residual analysis plot

This method is specific to the Linear Regression model and consists in visualizing how the residuals are distributed. Here’s what we expect:

A residual analysis plot. Image by Author.

To plot this we can use the built-in function sns.residplot() in Seaborn (here’s the documentation).

Instead, if there is a particular pattern in our residual plot, our model is not good for our ML problem. For example, consider the following:

A parabolical residuals analysis plot. Image by Author.

In this case, we can see that there is a parabolic trend: this means that our model (the Linear model) is not good to solve our ML problem.

2. The actual vs. predicted values plot

Suppose we have a result as follows:

An actual vs. predicted values plot in the case of linear regression. Image by Author.

The above graph shows that the predicted data points are distributed along a line. It is not a perfect linear distribution, so the linear model may not be ideal.

If, for our specific problem, we havey_train (the label on the training set) and we’ve calculated y_train_pred (the prediction on the training set), we can plot the following graph like so:

import matplotlib.pyplot as plt# Scatterplot of y_train and y_train_pred
plt.scatter(y_train, y_train_pred)
plt.plot(y_test, y_test, color='r') # Plot the line
# Labels
plt.title('ACTUAL VS PREDICTED VALUES')
plt.xlabel('ACTUAL VALUES')
plt.ylabel('PREDICTED VALUES')

3. The Kernel Density Estimation (KDE) plot

To understand the usefulness of a smoother function, see the graph below:

The idea behind KDE. Image by Author.

Suppose we have the following result:

A KDE plot. Image by Author.

In fact, what we expect from a “good” ML model are:

The curves are similar to bell curves, as much as possible.
The two curves are similar between them, as much as possible.

Now, let’s apply all the things we’ve learned so far here. We’ll use the famous “Ames Housing” dataset, which is perfect for our scopes.

This dataset has 80 features, but for simplicity, we’ll work with just a subset of them which are:

Overall Qual: it is the rating of the overall material and finish of the house on a scale from 1 (bad) to 10 (excellent).
Overall Cond: it is the rating of the overall condition of the house on a scale from 1 (bad) to 10 (excellent).
Gr Liv Area: it is the above-ground living area, measured in squared feet.
Total Bsmt SF: it is the total basement area, measured in squared feet.
SalePrice: it is the sale price, in USD $.

We’ll consider our SalePrice column as the target (label) variable, and the other columns as the features.

Exploratory Data Analysis EDA

Let’s import our data, create a subset with the mentioned features, and display some statistics:

import pandas as pd# Define the columns
columns = ['Overall Qual', 'Overall Cond', 'Gr Liv Area',
'Total Bsmt SF', 'SalePrice']
# Create dataframe
df = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',
sep='\t', usecols=columns)
# Show statistics
df.describe()

Statistics of the dataset. Image by Author.

Data preparation

What does “features scaling” mean?

Scaling a feature implies that the feature range is scaled between 0 and 1 or between 1 and -1. There are two typical methods to scale the features:

Mean normalization: Mean normalization is a method of scaling numeric data so that it has a minimum value of zero and a maximum value of one and all the values are normalized around the mean value. Suppose c is a value reached by our feature; to scale around the mean (c′ is the new value of c after the normalization process):

The formula for the mean normalization, powered by embed-dot-fun by the Author.

Let’s see an example in Python:

import numpy as np# Create a list of numbers
data = [1, 2, 3, 4, 5]
# Find min and max values
data_min = min(data)
data_max = max(data)
# Normalize the data
data_normalized = [(x - data_min) / (data_max - data_min) for x in data]
# Print the normalized data
print(f'normalized data: {data_normalized}')
>>>
normalized data: [0.0, 0.25, 0.5, 0.75, 1.0]

Standardization (or z-score normalization): This method transforms a variable so that it has a mean of zero and a standard deviation of one. The formula is the following (c′c’c′ is the new value of ccc after the normalization process):

The formula for the standardization, powered by embed-dot-fun by the Author.

Let’s see an example in Python:

import numpy as np# Original data
data = [1, 2, 3, 4, 5]
# Calculate mean and standard deviation
mean = np.mean(data)
std = np.std(data)
# Standardize the data
data_standardized = [(x - mean) / std for x in data]
# Print the standardized data
print(f'standardized values: {data_standardized}')
print(f'mean of standardized values: {np.mean(data_standardized)}')
print(f'std. dev. of standardized values: {np.std(data_standardized): .2f}')
>>>
standardized values: [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
mean of standardized values: 0.0
std. dev. of standardized values:  1.00

Features scaling is an important thing to do when working on an ML problem, for a simple reason:

If we perform exploratory data analysis with features that are not scaled, when calculating the mean values (for example, during the calculation of the coefficient of correlation) we’ll get numbers that are very different from each other. If we take a look at the statistics we’ve got above when we’ve invoked the df.describe() method, we can see that, for each column, we get a very different value of the mean. If we scale or normalize the features, instead, we’ll get 0s, 1s, and -1s: and this will help us mathematically.

Now, this dataset has some NaN values. We won’t show it for brevity (try it on your own), but we’ll remove them. Also, we’ll calculate the correlation matrix:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np# Drop NaNs from dataframe
df = df.dropna(axis=0)
# Apply mask
mask = np.triu(np.ones_like(df.corr()))
# Heat map for correlation coefficient
sns.heatmap(df.corr(), annot=True, fmt="0.1", mask=mask)

The correlation matrix for our data frame. Image by Author.

Finally, we subdivide the data frame dfinto X ( the features) and y(the label) and scale the features:

from sklearn.preprocessing import StandardScaler# Define the features
X = df.iloc[:,:-1]
# Define the label
y = df.iloc[:,-1]
# Scale the features
scaler = StandardScaler() # Call the scaler
X = scaler.fit_transform(X) # Fit the features to scale them

Fitting the linear regression model

Now we have to split the features X into the training and the test set and we’re fitting them with the Linear Regression model. Then, we calculate R² for both sets:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit the LR model
reg = LinearRegression().fit(X_train, y_train)
# Calculate R^2
coeff_det_train = reg.score(X_train, y_train)
coeff_det_test = reg.score(X_test, y_test)
# Print metrics
print(f" R^2 for training set: {coeff_det_train}")
print(f" R^2 for test set: {coeff_det_test}")
>>>
R^2 for training set:  0.77
R^2 for test set:  0.73

Notes:
1) your results can be slightly different due to the stocastical
nature of the ML models.2) here we can see generalization on action: 
we fitted the Linear Regression model to the train set with
reg = LinearRegression().fit(X_train, y_train).
The, we've calculated R^2 on the training and test sets with:
coeff_det_train = reg.score(X_train, y_train)
coeff_det_test = reg.score(X_test, y_test
In other words: we don't fit the data to the test set.
We fit the data to the training set and we calculate the scores
and predictions (see next snippet of code with KDE) on both sets
to see the generalization of our modelon new unseen data
(the data of the test set).

So we get R² of 0.77 on the training test and 0.73 on the test set which are quite good, suggesting the Linear model is a good one to solve this ML problem.

Let’s see the KDE plots for both sets:

# Calculate predictions
y_train_pred = reg.predict(X_train) # train set
y_test_pred = reg.predict(X_test) # test set# KDE train set
ax = sns.kdeplot(y_train, color='r', label='Actual Values') #actual values
sns.kdeplot(y_train_pred, color='b', label='Predicted Values', ax=ax) #predicted values
# Show title
plt.title('Actual vs Predicted values')
# Show legend
plt.legend()

KDE for the training set. Image by Author.

# KDE test set
ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values
sns.kdeplot(y_test_pred, color='b', label='Predicted Values', ax=ax) #predicted values# Show title
plt.title('Actual vs Predicted values')
# Show legend
plt.legend()

KDE for the test set. Image by Author.

I hope you’ll find useful this article. I know it’s very long, but I wanted to give you all the knowledge you need on this topic, so that you can return to it whenever you need it the most.

Some of the things we’ve discussed here are general topics, while others are specific to the Linear Regression model. Let’s summarize them:

The definition of regression is, of course, a general definition.
Correlation is generally referred to as the Linear model. In fact, as we said before, correlation is the tendency of two variables to be linearly dependent. However, there are ways to define non-linear correlations, but we leave them for other articles (but, as knowledge for you: just consider that they exist).
We’ve discussed the Simple and the Multiple Linear Regression models with their assumptions (the assumptions apply to both models).
When talking about how to find the line that best fits the data, we’ve referred to the article “Mastering the Art of Regression Analysis: 5 Key Metrics Every Data Scientist Should Know”. Here, we find all the metrics to know to solve a regression analysis. So, this is a generical topic that applies to any regression model, including the Linear one, of course.
We’ve shown three methods to validate our ML models: 1) The residual analysis plot: which applies to Linear Regression models, 2) The actual vs. predicted values plot: which can be applied to Linear and Polynomial models, 3) the KDE plot: this can be applied to any ML model, even in the case of a classification problem

Subscribe to my newsletter to get more on Python & Data Science.
Found it useful? Buy me a Ko-fi.
Liked the article? Join Medium through my referral link: unlock all the content on Medium for 5$/month (with no additional fee).
Find/contact me here.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists | Federico Trotta

All you need to know about Linear Regression is here (including an application in Python)

The math behind the correlation

Correlation is not causation

The correlation matrix

The Simple Linear Regression model

The Multiple Linear Regression model

1. The residual analysis plot

2. The actual vs. predicted values plot

3. The Kernel Density Estimation (KDE) plot

Exploratory Data Analysis EDA

Data preparation

Fitting the linear regression model

All you need to know about Linear Regression is here (including an application in Python)

The math behind the correlation

Correlation is not causation

The correlation matrix

The Simple Linear Regression model

The Multiple Linear Regression model

1. The residual analysis plot

2. The actual vs. predicted values plot

3. The Kernel Density Estimation (KDE) plot

Exploratory Data Analysis EDA

Data preparation

Fitting the linear regression model

Related Posts