Linear Regression In Depth (Part 2) | by Dr. Roi Yehoshua | Apr, 2023

By Jessie Hobb On Apr 25, 2023

Deep Dive into Multiple Linear Regression with Examples in Python

Photo by ThisisEngineering RAEng on Unsplash

In the first part of this article we formally defined the linear regression problem and showed how to solve simple linear regression problems, where the data set contains only one feature. In the second part of the article, we will discuss multiple linear regression problems, where the data set may contain any number of features.

We will first generalize the closed-form solution we have found for simple linear regression to any number of features. Then we will suggest an alternative approach for solving linear regression problems that is based on gradient descent, and discuss the pros and cons of this approach vs. using the closed-form solution. In addition, we will explore the classes in Scikit-Learn that implement both approaches, and demonstrate how to use them on a real-world data set.

Recall that in regression problems we are given a set of n labeled examples: D = {(x₁, y₁), (x₂, y₂), … , (xₙ, yₙ)}, where xᵢ is an m-dimensional vector containing the features of example i, and yᵢ is a real value that represents the label of that example.

In linear regression problems, we assume that there is a linear relationship between the feature vector x and the label y, so our model hypothesis takes the following form:

Our goal is to find the parameters w of this model that will minimize the sum of squared residuals:

The cost function in ordinary least squares (OLS) regression

In the previous part of the article, we have shown how to find the optimal w for the case of m = 1 using the normal equations. We will now extend these equations for any number of features m.

To simplify the derivation of the normal equations for the general case, we first define a matrix X that contains the values of all the features in the data set, including the intercept terms:

This matrix is called the design matrix. Each row in the design matrix represents an individual sample, and the columns represent the explanatory variables. The dimensions of the matrix are n × (m + 1), where n is the number of samples and m is the number of features.

In addition, we define the vector y as an n-dimensional vector that contains all the target values:

These definitions allow us to write the ordinary least squares (OLS) cost function in the following matrix form:

Proof:

We first note that:

The dot product of a vector with itself uᵗu is just the sum of all its components squared, therefore we have:

As was the case in simple linear regression, the function J(w) is convex, hence it has a single local minimum, which is also the global minimum. In order to find this global minimum, we compute the gradient of J(w) with respect to w and set it to zero.

The gradient of J(w) with respect to w is:

In the last two steps of the derivation above, we used basic rules of derivatives of functions of vectors, which will be explained in more detail in a future post.

We now set this gradient to zero in order to obtain the normal equations:

Therefore, the optimal w* that minimizes the least squares cost function is:

The closed-form solution to ordinary least squares

Note that we assumed here that the columns of X are linearly independent (i.e., X has a full column rank), otherwise XᵗX is not invertible, and there is no unique solution for w*.

When the columns of X are linearly dependent, we call this phenomenon multicollinearity. Mathematically, a set of variables is perfectly multicollinear if for all samples i:

Perfect multicollinearity

where λₖ are constants, and xᵢₖ is the value of feature k in sample i.

In practice, perfect multicollinearity is rare (e.g., it can be caused by accidentally duplicating a variable in the data). However, even lesser degrees of multicollinearity, where two or more features are highly correlated with each other (but not perfectly correlated), can cause issues both when fitting the model (the coefficients become very sensitive to small changes in the data) and when interpreting the results (it becomes hard to identify which features have the most impact on the model’s predictions).

The interested reader can find more info about the multicollinearity problem and how to handle it in this Wikipedia article.

To demonstrate the usage of the closed-form solution, let’s build a linear regression model for the California housing data set, available from the sklearn.datasets module. The goal in this data set is to predict the median house value of a given district (house block) in California, based on 8 different features of that district, such as the median income or the average number of rooms per household.

We first import the required libraries, and initialize the random seed in order to have reproducible results:

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltnp.random.seed(0)

Then, we fetch the data set:

from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

To explore the data set, we merge the features (X) and the labels (y) into a Pandas DataFrame, and display the first rows from the table:

mat = np.column_stack((X, y))
df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedianValue'))
df.head()

The first five rows from the California housing dataset

We can further investigate the data set by calling the DataFrame’s info() method, which provides information about the type of the columns and whether they contain any missing values:

df.info()

Luckily, this data set contains only numerical features and has no missing values. Therefore, no data preprocessing is required here (the closed-form solution does not require normalization of the data).

We now split the data into 80% training set and 20% test set:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let’s now write a general function to find the optimal w* for any given data set, using the closed-form solution we have found earlier:

def closed_form_solution(X, y):
w = np.linalg.inv(X.T @ X) @ X.T @ y
return w

We can implement the closed-form solution in a single line of code!

Before we can use this function to find the optimal w, we need to append a column of 1s to the matrix X_train in order to represent the intercept term. This can be easily done with the function np.column_stack():

X_train_b = np.column_stack((np.ones(len(X_train)), X_train))

We are now ready to fit our model to the training set:

w = closed_form_solution(X_train_b, y_train)
w

The optimal w is:

array([-3.68585691e+01,  4.33333407e-01,  9.29324337e-03, -9.86433739e-02,
5.93215487e-01, -7.56192502e-06, -4.74516383e-03, -4.21449336e-01,
-4.34166041e-01])

This vector has 9 components: one for each of the 8 features in the data set, and an extra weight for the intercept (w₀).

Let’s now evaluate the model on the training and the test sets. It is important to evaluate your model on both of them, because a large discrepancy between the training and the test scores may indicate that your model is overfitting.

We start by finding the R² score on the training set. To that end, we first get the predictions of the model on the training examples by multiplying the matrix X_train_b by the vector w:

y_train_pred = X_train_b @ w

We now use the r2_score() function from sklearn.metrics to find the R² score on the training set:

from sklearn.metrics import r2_scoretrain_score = r2_score(y_train, y_train_pred)
print(f'R2 score (train): {train_score:.5f}')

The score we get is:

R2 score (train): 0.60890

Let’s do the same on the test set. We need to append a column of 1s to X_test before we get the model’s predictions on it:

X_test_b = np.column_stack((np.ones(len(X_test)), X_test))
y_test_pred = X_test_b @ wtest_score = r2_score(y_test, y_test_pred)
print(f'R2 score (test): {test_score:.5f}')

The score we get is:

R2 score (test): 0.59432

The score is not high, which indicates that the relationship between the features and the label might not be linear. In our next article on polynomial regression, we will try to fit a non-linear regression model to this data set, and see if we can get better results.

Exercise

Let’s say we accidentally duplicated every point in the data set and then ran linear regression again. How will this affect the weights of the model?
Hint: Think what will happen to the design matrix X and the labels vector y, and how these changes will affect the normal equations.

The solution can be found at the end of this article.

Scikit-Learn provides a class named LinearRegression that also implements the closed-form solution to the ordinary least squares problem.

By default, this class automatically adds a column of 1s to the design matrix, so you do not need to add it manually as we did earlier (unless in the constructor you set the parameter fit_intercept to False).

Let’s use this class to fit a regression model to the same data set:

from sklearn.linear_model import LinearRegressionreg = LinearRegression()
reg.fit(X_train, y_train)

The fitted parameter vector w is stored in two attributes of this class:

coef_ is an array that contains all the weights except for the intercept
intercept_ is the intercept term (w₀)

Let’s print them:

print(reg.intercept_)
print(reg.coef_)

The output is:

-36.858569106801234
[ 4.33333407e-01  9.29324337e-03 -9.86433739e-02  5.93215487e-01
-7.56192502e-06 -4.74516383e-03 -4.21449336e-01 -4.34166041e-01]

We get exactly the same coefficients as we did with the computation in NumPy.

The score() method of this class returns the R² score of the model. It only requires the matrix X and the vector y of the data set you want to get the score on (so no need to compute the model’s predictions). For example, we can get the R² score on the training and the test sets as follows:

train_score = reg.score(X_train, y_train)
print(f'R2 score (train): {train_score:.5f}')test_score = reg.score(X_test, y_test)
print(f'R2 score (test): {test_score:.5f}')

R2 score (train): 0.60890
R2 score (test): 0.59432

As expected, we get the same R² scores as before.

In addition to evaluating the overall performance of the model, we often want to investigate the behavior of our regression errors. For example, are the errors normally distributed around 0 or are they skewed? Are there any inputs for which our model has particularly high prediction errors?

Answers to these questions will help us find the source of these errors. For example, if the errors are not normally distributed around 0, this might indicate that the linear regression model is not suitable for our data set and we need to try other regression models (e.g., polynomial regression). Or, if our model has particularly high prediction errors on some samples, they might be outliers and we need to investigate where they come from.

A plot that can help you answer these questions is a plot of the residuals vs. the predicted values of the model. This plot can show you the behavior of the errors across all the data samples.

We will use the following function to create this plot:

def plot_residuals(y_train_pred, y_train, y_test_pred, y_test):
plt.scatter(y_train_pred, y_train_pred - y_train, s=2, marker='o', c='b', label='Training')    
plt.scatter(y_test_pred, y_test_pred - y_test, s=2, marker='s', c='m', label='Test') xmin = min(y_train_pred.min(), y_test_pred.min())
xmax = max(y_train_pred.max(), y_test_pred.max())
plt.hlines(y=0, xmin=xmin, xmax=xmax, color='black')    
plt.xlim(xmin, xmax)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend()

Let’s call this function to show the residuals both on the training and the test sets:

plot_residuals(y_train_pred, y_train, y_test_pred, y_test)

We can see that most of the errors are symmetrically distributed around 0, but there are some outliers on the far ends of the input range, which may require further investigation.

Exercise

Download the Students Marks dataset from Kaggle. Build a linear regression model to predict a student’s mark based on their study time and number of courses they took. Compute the RMSE and R² score of the model both on the training and the test sets. Plot the residuals vs. the predicted values. What can you learn from this plot on the data set?

Although the closed-form solution gives us a direct way to find the optimal parameters of the regression model, it suffers from a few drawbacks:

The closed-form solution is computationally inefficient when we have a large number of features, since it requires the computation of the inverse of XᵗX, which is a m × m matrix (m is the number of features). Computing the inverse of a matrix has a runtime complexity of O(m³) under most implementations.
It requires to have the entire design matrix X in memory, which is not always feasible if we have a very large data set.
It does not support online (incremental) learning, since any change to the design matrix X requires re-computation of the inverse of XᵗX.

Gladly, there is an alternative approach for finding the optimal w, which is gradient descent. Gradient descent is an iterative approach for finding a minimum of a function, where we take small steps in the opposite direction of the gradient in order to get closer to the minimum:

In order to use gradient descent to find the minimum of the least squares cost, we need to compute the partial derivatives of J(w) with respect to each one of the weights.

To simplify the mathematical derivation, we will assume that we have only one training example (x, y) in the data set. The extension to n training samples is straightforward, since the derivative of the sum of functions is just the sum of their derivatives.

The partial derivative of J(w) with respect to any of the weights wⱼ is:

Therefore, the gradient descent update rule is:

where α is a learning rate that controls the step size (0 < α < 1).

Instead of updating each component of w separately, we can update the entire vector w in one step:

The gradient descent update rule in vector form

Gradient descent can be applied in one of the following modes:

Batch gradient descent — the weights are updated after we compute the error on the entire training set.
Stochastic gradient descent (SGD) — a gradient descent step is performed after every training example. In this case, the gradient descent update rule takes the following form:

SGD update rule

SGD typically converges faster than batch gradient descent as it makes progress after each example, and it also supports online learning since it can process new data points one at a time. On the other hand, SGD is less stable than batch gradient descent, and its convergence to the global optimum is not guaranteed (although in practice it gets very close to the optimum).

Note that whenever you use gradient descent, you must make sure that your data set is normalized (otherwise gradient descent may take steps of different sizes in different directions, which will make it unstable).

The class SGDRegressor in Scikit-Learn implements the SGD approach for fitting a linear regression model. The specific type of regression model is determined by its loss parameter. By default, its value is ‘squared_error’, which means fitting an ordinary least squares model. Other options include ‘huber’ and ‘epsilon_insensitive’ (the loss function used in Support Vector Regression), which will be discussed in future articles.

In addition to the loss function, the constructor of this class has a few other important parameters:

penalty — the type of regularization to use (defaults to ‘l2’).
alpha — the regularization coefficient (defaults to 0.0001).
max_iter — the maximum number of epochs over the training data (defaults to 1000).
learning_rate — learning rate schedule for the weight updates (defaults to ‘invscaling’).
eta0 — the initial learning rate used (defaults to 0.01).
early_stopping — whether to stop the training when the validation score is not improving (defaults to False).
validation_fraction — the proportion of the training set to set aside for validation (defaults to 0.1).

Since we need to normalize our data before using SGD, we will create a pipeline that consists of two steps:

A StandardScaler that normalizes the features by removing the mean and scaling them to unit variance.
An SGDRegressor with its default settings.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipelinepipeline = Pipeline([
('scaler', StandardScaler()),
('reg', SGDRegressor())
])

Let’s fit the pipeline to our training set:

pipeline.fit(X_train, y_train)

And now let’s evaluate the model on the training and test sets:

train_score = pipeline.score(X_train, y_train)
print(f'R2 score (train): {train_score:.5f}')test_score = pipeline.score(X_test, y_test)
print(f'R2 score (test): {test_score:.5f}')

The scores we get are:

R2 score (train): -485.79460
R2 score (test): -4485.30424

These are very bad scores! What has just happened?

When you get such bad scores with gradient descent, it usually means that your learning rate is too high, which causes the algorithm to oscillate between the two sides of the minimum:

Oscillations in gradient descent due to a high learning rate

Let’s reduce the learning rate from 0.01 to 0.001 by changing the parameter eta0 of the SGDRegressor:

pipeline.set_params(reg__eta0=0.001)

Let’s refit the pipeline to the training set and reevaluate it:

pipeline.fit(X_train, y_train)

train_score = pipeline.score(X_train, y_train)
print(f'R2 score (train): {train_score:.5f}')test_score = pipeline.score(X_test, y_test)
print(f'R2 score (test): {test_score:.5f}')

The scores we get this time are:

R2 score (train): 0.60188
R2 score (test): 0.58393

which are similar to the R² scores we obtained with the closed-form solution (they are a bit lower, since SGD gets close to the global minimum, but not to the minimum itself).

Deep Dive into Multiple Linear Regression with Examples in Python

In linear regression problems, we assume that there is a linear relationship between the feature vector x and the label y, so our model hypothesis takes the following form:

Our goal is to find the parameters w of this model that will minimize the sum of squared residuals:

In the previous part of the article, we have shown how to find the optimal w for the case of m = 1 using the normal equations. We will now extend these equations for any number of features m.

To simplify the derivation of the normal equations for the general case, we first define a matrix X that contains the values of all the features in the data set, including the intercept terms:

In addition, we define the vector y as an n-dimensional vector that contains all the target values:

These definitions allow us to write the ordinary least squares (OLS) cost function in the following matrix form:

Proof:

We first note that:

The dot product of a vector with itself uᵗu is just the sum of all its components squared, therefore we have:

The gradient of J(w) with respect to w is:

In the last two steps of the derivation above, we used basic rules of derivatives of functions of vectors, which will be explained in more detail in a future post.

We now set this gradient to zero in order to obtain the normal equations:

Therefore, the optimal w* that minimizes the least squares cost function is:

The closed-form solution to ordinary least squares

Note that we assumed here that the columns of X are linearly independent (i.e., X has a full column rank), otherwise XᵗX is not invertible, and there is no unique solution for w*.

When the columns of X are linearly dependent, we call this phenomenon multicollinearity. Mathematically, a set of variables is perfectly multicollinear if for all samples i:

Perfect multicollinearity

where λₖ are constants, and xᵢₖ is the value of feature k in sample i.

The interested reader can find more info about the multicollinearity problem and how to handle it in this Wikipedia article.

We first import the required libraries, and initialize the random seed in order to have reproducible results:

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltnp.random.seed(0)

Then, we fetch the data set:

from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

To explore the data set, we merge the features (X) and the labels (y) into a Pandas DataFrame, and display the first rows from the table:

mat = np.column_stack((X, y))
df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedianValue'))
df.head()

We can further investigate the data set by calling the DataFrame’s info() method, which provides information about the type of the columns and whether they contain any missing values:

df.info()

We now split the data into 80% training set and 20% test set:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let’s now write a general function to find the optimal w* for any given data set, using the closed-form solution we have found earlier:

def closed_form_solution(X, y):
w = np.linalg.inv(X.T @ X) @ X.T @ y
return w

We can implement the closed-form solution in a single line of code!

X_train_b = np.column_stack((np.ones(len(X_train)), X_train))

We are now ready to fit our model to the training set:

w = closed_form_solution(X_train_b, y_train)
w

The optimal w is:

array([-3.68585691e+01,  4.33333407e-01,  9.29324337e-03, -9.86433739e-02,
5.93215487e-01, -7.56192502e-06, -4.74516383e-03, -4.21449336e-01,
-4.34166041e-01])

This vector has 9 components: one for each of the 8 features in the data set, and an extra weight for the intercept (w₀).

We start by finding the R² score on the training set. To that end, we first get the predictions of the model on the training examples by multiplying the matrix X_train_b by the vector w:

y_train_pred = X_train_b @ w

We now use the r2_score() function from sklearn.metrics to find the R² score on the training set:

from sklearn.metrics import r2_scoretrain_score = r2_score(y_train, y_train_pred)
print(f'R2 score (train): {train_score:.5f}')

The score we get is:

R2 score (train): 0.60890

Let’s do the same on the test set. We need to append a column of 1s to X_test before we get the model’s predictions on it:

X_test_b = np.column_stack((np.ones(len(X_test)), X_test))
y_test_pred = X_test_b @ wtest_score = r2_score(y_test, y_test_pred)
print(f'R2 score (test): {test_score:.5f}')

The score we get is:

R2 score (test): 0.59432

Exercise

The solution can be found at the end of this article.

Scikit-Learn provides a class named LinearRegression that also implements the closed-form solution to the ordinary least squares problem.

Let’s use this class to fit a regression model to the same data set:

from sklearn.linear_model import LinearRegressionreg = LinearRegression()
reg.fit(X_train, y_train)

The fitted parameter vector w is stored in two attributes of this class:

coef_ is an array that contains all the weights except for the intercept
intercept_ is the intercept term (w₀)

Let’s print them:

print(reg.intercept_)
print(reg.coef_)

The output is:

-36.858569106801234
[ 4.33333407e-01  9.29324337e-03 -9.86433739e-02  5.93215487e-01
-7.56192502e-06 -4.74516383e-03 -4.21449336e-01 -4.34166041e-01]

We get exactly the same coefficients as we did with the computation in NumPy.

train_score = reg.score(X_train, y_train)
print(f'R2 score (train): {train_score:.5f}')test_score = reg.score(X_test, y_test)
print(f'R2 score (test): {test_score:.5f}')

R2 score (train): 0.60890
R2 score (test): 0.59432

As expected, we get the same R² scores as before.

A plot that can help you answer these questions is a plot of the residuals vs. the predicted values of the model. This plot can show you the behavior of the errors across all the data samples.

We will use the following function to create this plot:

def plot_residuals(y_train_pred, y_train, y_test_pred, y_test):
plt.scatter(y_train_pred, y_train_pred - y_train, s=2, marker='o', c='b', label='Training')    
plt.scatter(y_test_pred, y_test_pred - y_test, s=2, marker='s', c='m', label='Test') xmin = min(y_train_pred.min(), y_test_pred.min())
xmax = max(y_train_pred.max(), y_test_pred.max())
plt.hlines(y=0, xmin=xmin, xmax=xmax, color='black')    
plt.xlim(xmin, xmax)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend()

Let’s call this function to show the residuals both on the training and the test sets:

plot_residuals(y_train_pred, y_train, y_test_pred, y_test)

We can see that most of the errors are symmetrically distributed around 0, but there are some outliers on the far ends of the input range, which may require further investigation.

Exercise

Although the closed-form solution gives us a direct way to find the optimal parameters of the regression model, it suffers from a few drawbacks:

The closed-form solution is computationally inefficient when we have a large number of features, since it requires the computation of the inverse of XᵗX, which is a m × m matrix (m is the number of features). Computing the inverse of a matrix has a runtime complexity of O(m³) under most implementations.
It requires to have the entire design matrix X in memory, which is not always feasible if we have a very large data set.
It does not support online (incremental) learning, since any change to the design matrix X requires re-computation of the inverse of XᵗX.

In order to use gradient descent to find the minimum of the least squares cost, we need to compute the partial derivatives of J(w) with respect to each one of the weights.

The partial derivative of J(w) with respect to any of the weights wⱼ is:

Therefore, the gradient descent update rule is:

where α is a learning rate that controls the step size (0 < α < 1).

Instead of updating each component of w separately, we can update the entire vector w in one step:

Gradient descent can be applied in one of the following modes:

Batch gradient descent — the weights are updated after we compute the error on the entire training set.
Stochastic gradient descent (SGD) — a gradient descent step is performed after every training example. In this case, the gradient descent update rule takes the following form:

SGD update rule

In addition to the loss function, the constructor of this class has a few other important parameters:

penalty — the type of regularization to use (defaults to ‘l2’).
alpha — the regularization coefficient (defaults to 0.0001).
max_iter — the maximum number of epochs over the training data (defaults to 1000).
learning_rate — learning rate schedule for the weight updates (defaults to ‘invscaling’).
eta0 — the initial learning rate used (defaults to 0.01).
early_stopping — whether to stop the training when the validation score is not improving (defaults to False).
validation_fraction — the proportion of the training set to set aside for validation (defaults to 0.1).

Since we need to normalize our data before using SGD, we will create a pipeline that consists of two steps:

A StandardScaler that normalizes the features by removing the mean and scaling them to unit variance.
An SGDRegressor with its default settings.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipelinepipeline = Pipeline([
('scaler', StandardScaler()),
('reg', SGDRegressor())
])

Let’s fit the pipeline to our training set:

pipeline.fit(X_train, y_train)

And now let’s evaluate the model on the training and test sets:

train_score = pipeline.score(X_train, y_train)
print(f'R2 score (train): {train_score:.5f}')test_score = pipeline.score(X_test, y_test)
print(f'R2 score (test): {test_score:.5f}')

The scores we get are:

R2 score (train): -485.79460
R2 score (test): -4485.30424

These are very bad scores! What has just happened?

When you get such bad scores with gradient descent, it usually means that your learning rate is too high, which causes the algorithm to oscillate between the two sides of the minimum:

Let’s reduce the learning rate from 0.01 to 0.001 by changing the parameter eta0 of the SGDRegressor:

pipeline.set_params(reg__eta0=0.001)

Let’s refit the pipeline to the training set and reevaluate it:

pipeline.fit(X_train, y_train)

train_score = pipeline.score(X_train, y_train)
print(f'R2 score (train): {train_score:.5f}')test_score = pipeline.score(X_test, y_test)
print(f'R2 score (test): {test_score:.5f}')

The scores we get this time are:

R2 score (train): 0.60188
R2 score (test): 0.58393

which are similar to the R² scores we obtained with the closed-form solution (they are a bit lower, since SGD gets close to the global minimum, but not to the minimum itself).

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.