Multiple Linear Regression: A Deep Dive

By Jessie Hobb On Mar 3, 2023

Multiple Linear Regression from Scratch: Deep Understanding

Multiple Regression with Two Features (x1 and x2) (Image By Author)

Motivation

We, human beings, have been trying to create intelligent systems for a long ago. Because if we can automate a system, it will make our life easier and can be used as our assistant. Machine learning makes it happen. There are many machine learning algorithms for solving different problems. This article will introduce a machine learning algorithm which can solve the regression problem (prediction of continuous value) with multiple variables. Suppose you are running a real estate business. As a business owner, you should have a proper idea about the price of buildings, lands, etc., to make your business profitable. It is quite difficult for a person to track the prices in a wide range of areas. An efficient machine learning regression model may serve you a lot. Just imagine you are entering inputs like location, size and other relevant information into a system, and it is automatically showing you the price. Multiple Linear Regression can exactly do the same. Isn’t it interesting!

I will explain the process of multiple linear regression and show you the implementation from scratch.

[N.B. — If you don’t have a clear concept about simple linear regression, I suggest you go through the article before diving into the multiple linear regression.]

What is Multiple Linear Regression Problem?

In simple linear regression, only one feature (independent variable) can exist. But in multiple linear regression, there is more than one feature. Both of them predict continuous values.

Simple Linear Regression Problem (Image By Author)

Look at the table. There is the price of a product against a weight. With linear regression, if we can fit a line for the given value, we can easily predict the price by inserting the weight of a product into the model. The same process is applicable for multiple linear regression. Instead of one feature, there will be multiple features (independent variables).

Multiple Linear Regression Problem (Image by Author)

The above example has two features “Age” and “Income”. We have to predict the monthly “Expenditure” for two new features. It’s an example of multiple linear regression.

Multiple linear regression is not limited to only two features. It may have more than two features.

When do We Use Multiple Linear Regression?

Simple or univariate linear regression works only for predicting continuous values from one independent feature.

The process of simple linear regression doesn’t work for multiple features. We must apply multiple linear regression if we need to predict continuous values from more than one feature (variables). It is worth mentioning that the data must be linearly distributed. Non-linear data is not suitable for linear regression.

Multiple Linear Regression in Detail

Let’s try to visually represent the multiple linear regression. I have tried to keep the model simple with only two independent variables (features).

Multiple Linear Regression with Two Features (x1 and x2) (Image By Author)

x1 and x2 are the two features (independent variables). Suppose x1=4 and x2=5. We will get the point A if we project these values on the x1-x2 plane. In the multiple regression model, we need to create a regression plane from our dataset, as shown in the diagram. Now, drawing a vertical line on the regression plane will intersect the plane to a certain point. We will get the predicted value by drawing a horizontal line from the intersecting point to the y-axis. The predicted value is the intersected point of the y-axis.

[N.B. — I have visualized the multiple linear regression with only two features for demonstration purposes because it is impossible to visualize more than two features. In the case of higher features, the processes are the same.]

Let’s Try to Dig Deeper

In simple linear regression, we predict a dependent value based on an independent value. [Read the previous article for a more detailed explanation of simple linear regression.]

For example, a simple linear regression equation, yi=mxi + c. Here, ‘m’ is the slope of the regression line, and ‘c’ is the y-intercept value.

In case of more than one independent variable, we need to extend our regression equation as follows.

Where,

y indicates the dependent variable (predicted regression value).

x1,x2, ……,xn are the different independent variables.

m1, m2, m3,…….,mn symbolize the slope coefficients of different independent variables.

m0 is the intersect point value of the y-axis.

Now, we will get the independent features from the dataset. Our main challenge is to find out the coefficient values of the slopes (m0,m2,……., mn ).

Let’s consider the multiple linear regression problem dataset shown in the first section.

It is easy to predict the ‘Expenditure ’of individuals if we have the optimum value of m0, m1 and m2. We can easily get the ‘Expenditure’ by putting the Age and Income value.

But there is no straightforward way to find the optimum value of the coefficients. To do so, we need to minimize the cost (loss) function with the help of Gradient Descent.

A Bit Detail about Gradient descent —

Before diving into the gradient descent, we should have a clear idea about the cost function. The cost function is nothing but an error function. It measures the accuracy of a predicted model. We will use the following error function as a cost function.

Here, y̅i is the predicted value, and yi is the actual value.

Gradient descent is an optimization algorithm. We use this algorithm to minimize the cost function by optimizing the coefficients of the regression equations.

The red curve is the derivative of the cost function. To optimize the coefficient, we randomly assign a weight for the coefficient. Now, we will calculate the derivative of the cost function. We will consider the simple linear regression equation to make it simple.

Let’s replace y̅i with (mxi+c). It implies the equation as follows —

2. The partial derivative w.r.t m and c.

[N.B. — You may find some cost functions which are multiplied with 1/2n instead of 1/n. It is not a big deal. If you use 1/2n, the derivative will neutralize it, and the output will be 1/n instead of 2/n. In the implementation section, we also use 1/2n.]

3. Now, we will update the value of m and c iteratively with the following equations.

α is the learning rate that indicates how much we move in each step to minimize the cost function (shown in the figure). The iteration will be continued until the cost function is significantly minimized.

For multiple linear regression, the whole process is the same. Let’s again consider the equation for multiple linear regression.

We will get a common form if we calculate the derivative for the coefficients like the simple linear equation (shown above).

Where j takes the values 1,2,…..,n, representing the features.

For m0, the derivative will be —

We will update all the coefficients simultaneously with the following formula.

And for m0, we will use the equation below.

We will continuously update all the coefficients to fit the model and calculate the costs. If the cost is significantly low, we will stop updating the coefficients.

But the process is computationally expensive and time-consuming. Vectorization makes it easy to implement.

Vectorized Method of Linear Regression

Let’s consider the multiple linear regression again.

We have added a constant xi0=1 for the convenience of calculation. It doesn’t affect the previous equation. Let’s see the vectorized representation of the equation.

Vectorized Implementation of Linear Regression Equation (Image by Author)

Here, yi=1…..z, z is the number of total dataset instances. X holds all the feature values up to z instances.

In short, the vectorized equation is —

Now, the vectorized cost function’s derivative will be as follows (Details explanation).

It’s time to update the weights with the formula given below.

Yeah! We have completed our theoretical process. It’s time to codify the whole process.

Python Implementation from Scratch

It’s time to make our hands dirty with hands-on coding. I will show step-by-step guidelines.

[N.B. — We use the Boston House Price dataset for demonstration purposes. It is registered under the public domain. Download from here.]

Importing libraries

Reading the dataset

We see there are no column names in the main dataset. In the next step, we will set the column names according to the documentation.

Setting Column Names

We have successfully added the columns to our Dataframe.

Along the article, our main focus is to understand the internal process of multiple linear regression. So, we will be focused mainly on the implementation rather than the effectiveness of our project. To keep our model simple, we will consider the highly correlated features.

Let’s find the correlation with the target column ‘MEDV’

For our convenience, we have picked up three features, ‘RM’, ‘DIS’, and ‘B’, with the target value ‘MEDV’.

Normalize the features

Normalization reduces the calculational complexity of our model. So, we will normalize our features.

Splitting the dataset into test and train sets

We have to split the dataset into train and test sets for evaluation purposes. With the training set, we will train the model and evaluate our model with the test set.

Here, we have split 75% data for the training and kept 25% for testing.

Gradient descent optimization function with vectorized implementation

Let’s call the function to find out our optimum value of the coefficients.

The function returns two values, coefficients (w) and a list of loss values.

Visualization of the optimization over the iteration

Creating a prediction function for predicting new values

Let’s predict the values for test features.

Comparing our scratch model with the standard scikit-learn library

Before jumping to the comparison, we have to create a prediction model for multiple linear regression with scikit-learn.

Linear regression with scikit-learn

Comparing the models in terms of MSE

It seems the two MSE values are very much similar. Even our model’s MSE is slightly less than the scikit-lean model.

How much do the two models’ predictions differ?

Results show that our optimization is perfect, and it works similarly to the benchmark model of scikit-learn.