The Matrix Algebra of Linear Regression | by Rob Taylor, PhD | May, 2023


Photo by Mingwei Lim on Unsplash

Introduction

For most, simple linear regression is a common starting point for understanding model-based estimation and model comparison. Regardless of whether you’re taking an introductory statistics or data science course, you can bet your bottom dollar that linear regression will crop up at some point. And there’s a good reason for it.

Simple linear regression provides a natural extension of simple descriptive statistics by modeling the response variable as a linear combination of only a single predictor. This simplicity not only facilitates the interpretation of model parameters but also makes estimation via ordinary least squares (OLS) easier to grasp.

While most textbook introductions will provide a detailed mathematical treatment in addition to the broader conceptual elements, when it comes to actually implementing the model it’s almost never through first principles. No matter which language you’re using there’ll almost certainly be a convenience function that fits the model for you. And why wouldn’t you use it? You don’t have to do all those tedious calculations by hand! That’s definitely a plus; though I’m a firm believer that taking some time to familiarise yourself with a model’s statistical machinery is an essential part of becoming an effective analyst and data scientist.

In an earlier post, I provided a primer on linear algebra that touched on some fundamental principles and operations. In this post, we’re going to build on those concepts by looking under the hood and getting our hands dirty with the matrix operations that underpin linear regression.

Simple Linear Regression in Matrix Form

Most will be familiar with the standard regression formula that models a response variable Y as a linear combination of a single predictor X:

The linear regression equation (image by author).

where here I’ve adopted the convention of assuming errors are normally distributed. From here we’ll build out the matrix representation by assigning the elements to vectors and matrices.

First, we’ll put all the responses in an n-dimensional vector called the response vector:

The response vector (image by author).

I’ve been quite explicit here by including the size of the vector — which is actually represented as a column matrix — just so we can keep track of things. However, it’s perfectly reasonable to use a lowercase boldface y if you wanted to.

Next, the predictor X is placed into an n × p matrix called the design matrix:

The design matrix (image by author).

where p denotes the number of columns and corresponds to the number of coefficients in the model. Note that the first column contains only ones — we’ll discuss this shortly — but this is to accommodate the intercept, which is a constant. Therefore, the number of columns in the design matrix is always one greater than the number of predictors you have. In the example above, we only have a single predictor meaning we need to estimate an intercept and a slope; so, p = 2 in this case.

The regression coefficients are also placed into a p × 1 vector called the parameter vector:

The parameter vector (image by author).

Again, p denotes the number of parameters, though here p denotes the number of rows, unlike the design matrix where p is the column dimension. This arrangement is important because we will need to do some matrix multiplication with these two objects to compute the linear predictor.

However, before we do that, there is one last thing to do — place all the error terms into an n × 1 vector:

The error vector (image by author).

With all of these in place, we can now express the simple linear regression model using matrix notation like so:

The linear regression model in matrix form (image by author).

The Linear Predictor

In words, the matrix formulation of the linear regression model is the product of two matrices X and β plus an error vector. The product of X and β is an n × 1 matrix called the linear predictor, which I’ll denote here:

The linear predictor vector (image by author).

Now, matrix multiplication works a little differently than you might expect. I covered this in my primer on linear algebra — so if you haven’t checked it out you can find it here — but I’ll quickly run through the details now.

If we have some m × q matrix A and another q × r matrix B, then the product is an m × r matrix C (note how the q dimension drops out from the result). The reason for this change in size is because the element located at row i and column j in C is the dot product of the ith row in A and the jth column in B:

The dot product (image by author).

So, because the dot product takes the sum across the q dimension, this drops out from the resulting matrix.

For the simple linear regression case, the product is between an n × p matrix X and a p × 1 matrix β, which therefore results in an n × 1 matrix η. Following on from above, the (i, j) element in η is computed using the following dot product:

The linear predictor (image by author).

Here the sum is taken over p, which we know is the number of coefficients in the model, and so p = 2.

If we then substitute the dot product into the linear predictor vector and substitute in the values from the first column of the design matrix, we get the following (because j = 1 I’m going to drop that subscript to declutter the notation a little):

The linear predictor in full (image by author).

All of this reduces to something very familiar indeed: each element in η is just our linear regression equation applied to each value of X! Hopefully, you can see why the column of ones is included in the design matrix. This ensures that the intercept is added to each observation.

Model Errors

There are three critical assumptions relating to the error terms — or perturbations — around the linear predictor that fall out of the Gauss-Markov theorem. The first is that the expected conditional mean of the errors is zero, implying that the average of the errors should not depend on any particular value of X. This is called the zero conditional mean assumption:

The zero conditional mean assumption (image by author).

Related to this is the homoscedasticity assumption which states that the variance of the errors should not be affected by the values of the independent variables. That is, the distribution of the errors should be entirely independent of any information contained within the design matrix:

The homoscedasticity assumption (image by author).

The final assumption is the no autocorrelation assumption which requires that error terms are uncorrelated. This implies that knowledge about one error term does not confer any information about another and therefore does not co-vary:

No autocorrelation assumption (image by author).

Under these assumptions, the covariance matrix for the error terms is a scalar matrix and the errors are considered to be spherical:

The covariance matrix under the Gauss-Markov assumptions (image by author).

Just a note before moving on, while I adopted the convention of normally distributed error, the Gauss-Markov theorem does not require the error terms to be normal, nor does it require that errors are independent and identically distributed; only that the error terms are homoscedastic and uncorrelated. This implies that a variable can have dependencies, but so long as those dependencies are not linear — which is what correlation measures — then parameter estimation can proceed safely.

Parameter Estimation via Ordinary Least Squares

When fitting a linear regression model to data the goal is to estimate the unknown model coefficients contained in β. The usual approach to finding the best values for these unknowns is to satisfy the least squares criterion, the goal of which is to minimize the total squared error between the linear predictor and the response. I’m going to step through how this is implemented for the simple linear regression model, though will move through this section fairly quickly.

In matrix form, the vector of errors, or residuals, is defined as follows:

Model error, or residuals (image by author).

where the hat on top of β denotes the estimated coefficients. The sum of the squared residuals can then be written as the dot product of the error vector with itself:

The sum squared error (image by author).

where T denotes the transpose operator¹. To derive the least squares criterion it’s convenient to write out the errors in full as follows:

Expanding the dot product of the error vector (image by author).

The idea, then, is to find the parameters that minimize this value. To do so, we need to take the derivative of the above with respect to the vector β and set this to zero:

The first derivative of the sum of squared error (image by author).

from which the normal equation can be derived:

The normal equation (image by author).

From here, we can find a solution for our unknown parameters by multiplying each side of the normal equation by the inverse of XX. I cover matrix inversion in this post, though if you haven’t read that, a matrix A is invertible if there exists a matrix B such that their product returns the identity matrix, I.

For a simple linear regression model, XX is square with a size of 2 × 2, though more generally the matrix will be a p × p matrix. We then need to find another 2 × 2 matrix that is the multiplicative inverse of XX. If such a matrix does not exist then the equation cannot be solved, though if XX is indeed invertible, then we can obtain the parameter vector b as follows:

The estimation equation for linear regression (image by author).

Here you can see why it is necessary that the design matrix is not rank deficient. If there are indeed linear dependencies between columns then the XX cannot be inverted, meaning a solution does not exist. This is particularly true if you have perfect multicollinearity.

A Closer Look at Parameter Estimation

It is interesting to further consider the elements contained in each matrix. First, let’s look at the cross-product of the design matrix XX:

Taking the cross product of the design matrix (image by author).

From here we can see that the matrix contains products of each column in the design matrix. But what we need is the inverse of this matrix. I won’t go through how to derive the inverse, but it looks like this:

The inverse of the design matrix cross product (image by author).

Finally, we also need the cross-product of the design matrix and the response vector Y, which produces the following:

The cross product of the design matrix with the response vector (image by author).

With the matrices written out in full, we can substitute them into the estimation question and work through the calculation like so:

Derivation of the parameter vector (image by author).

To be fair, there is a bit going on in that derivation, but it’s really the last line that is most interesting. What this all boils down to is something quite convenient; we can estimate the slope coefficient using the sample covariance and variance like so:

Estimate for the slope coefficient (image by author).

Once we have that, we can use this estimate, along with the mean of y and x, to derive the estimate for the intercept:

Estimate for the intercept coefficient (image by author).

Fitted Values

We have now used some linear algebra to find the best-fitting parameters for a simple linear regression model. Now that we have these in hand the next step is to see how well the fitted values correspond with the values contained in the response vector.

All we need to obtain the fitted values is the design matrix along with the parameter vector b. Multiplying together, the fitted values are computed like so:

Note that our fitted values have a hat placed on top of them to indicate that they’re estimated values. For this reason, an alternative way to express the fitted values is as a combination of the response vector and the hat matrix, which gets its name from the fact that it puts a hat on Y.

To see this, let’s write out the equation for the fitted values by substituting b with the full equation for the parameter vector:

Basically, every term that involves the design matrix X is lumped together to form the definition of the hat matrix:

Final Remarks

If you’ve made it to this point, thanks for sticking at it. There was a lot to unpack, for sure. However, I hope you can see how the principles of matrix algebra are used to build simple linear regression models. Now, while I have focussed on simple linear regression throughout to keep the examples concise, everything discussed here works the same way for multiple regression, too. All that changes is that the dimensionality of the matrices and vectors increases.

Footnotes

[1] The notation I have used here is different from what I used in my primer article, but they mean exactly the same thing. The reason for the change is to keep in step with how the matrix formulations are typically described.


Photo by Mingwei Lim on Unsplash

Introduction

For most, simple linear regression is a common starting point for understanding model-based estimation and model comparison. Regardless of whether you’re taking an introductory statistics or data science course, you can bet your bottom dollar that linear regression will crop up at some point. And there’s a good reason for it.

Simple linear regression provides a natural extension of simple descriptive statistics by modeling the response variable as a linear combination of only a single predictor. This simplicity not only facilitates the interpretation of model parameters but also makes estimation via ordinary least squares (OLS) easier to grasp.

While most textbook introductions will provide a detailed mathematical treatment in addition to the broader conceptual elements, when it comes to actually implementing the model it’s almost never through first principles. No matter which language you’re using there’ll almost certainly be a convenience function that fits the model for you. And why wouldn’t you use it? You don’t have to do all those tedious calculations by hand! That’s definitely a plus; though I’m a firm believer that taking some time to familiarise yourself with a model’s statistical machinery is an essential part of becoming an effective analyst and data scientist.

In an earlier post, I provided a primer on linear algebra that touched on some fundamental principles and operations. In this post, we’re going to build on those concepts by looking under the hood and getting our hands dirty with the matrix operations that underpin linear regression.

Simple Linear Regression in Matrix Form

Most will be familiar with the standard regression formula that models a response variable Y as a linear combination of a single predictor X:

The linear regression equation (image by author).

where here I’ve adopted the convention of assuming errors are normally distributed. From here we’ll build out the matrix representation by assigning the elements to vectors and matrices.

First, we’ll put all the responses in an n-dimensional vector called the response vector:

The response vector (image by author).

I’ve been quite explicit here by including the size of the vector — which is actually represented as a column matrix — just so we can keep track of things. However, it’s perfectly reasonable to use a lowercase boldface y if you wanted to.

Next, the predictor X is placed into an n × p matrix called the design matrix:

The design matrix (image by author).

where p denotes the number of columns and corresponds to the number of coefficients in the model. Note that the first column contains only ones — we’ll discuss this shortly — but this is to accommodate the intercept, which is a constant. Therefore, the number of columns in the design matrix is always one greater than the number of predictors you have. In the example above, we only have a single predictor meaning we need to estimate an intercept and a slope; so, p = 2 in this case.

The regression coefficients are also placed into a p × 1 vector called the parameter vector:

The parameter vector (image by author).

Again, p denotes the number of parameters, though here p denotes the number of rows, unlike the design matrix where p is the column dimension. This arrangement is important because we will need to do some matrix multiplication with these two objects to compute the linear predictor.

However, before we do that, there is one last thing to do — place all the error terms into an n × 1 vector:

The error vector (image by author).

With all of these in place, we can now express the simple linear regression model using matrix notation like so:

The linear regression model in matrix form (image by author).

The Linear Predictor

In words, the matrix formulation of the linear regression model is the product of two matrices X and β plus an error vector. The product of X and β is an n × 1 matrix called the linear predictor, which I’ll denote here:

The linear predictor vector (image by author).

Now, matrix multiplication works a little differently than you might expect. I covered this in my primer on linear algebra — so if you haven’t checked it out you can find it here — but I’ll quickly run through the details now.

If we have some m × q matrix A and another q × r matrix B, then the product is an m × r matrix C (note how the q dimension drops out from the result). The reason for this change in size is because the element located at row i and column j in C is the dot product of the ith row in A and the jth column in B:

The dot product (image by author).

So, because the dot product takes the sum across the q dimension, this drops out from the resulting matrix.

For the simple linear regression case, the product is between an n × p matrix X and a p × 1 matrix β, which therefore results in an n × 1 matrix η. Following on from above, the (i, j) element in η is computed using the following dot product:

The linear predictor (image by author).

Here the sum is taken over p, which we know is the number of coefficients in the model, and so p = 2.

If we then substitute the dot product into the linear predictor vector and substitute in the values from the first column of the design matrix, we get the following (because j = 1 I’m going to drop that subscript to declutter the notation a little):

The linear predictor in full (image by author).

All of this reduces to something very familiar indeed: each element in η is just our linear regression equation applied to each value of X! Hopefully, you can see why the column of ones is included in the design matrix. This ensures that the intercept is added to each observation.

Model Errors

There are three critical assumptions relating to the error terms — or perturbations — around the linear predictor that fall out of the Gauss-Markov theorem. The first is that the expected conditional mean of the errors is zero, implying that the average of the errors should not depend on any particular value of X. This is called the zero conditional mean assumption:

The zero conditional mean assumption (image by author).

Related to this is the homoscedasticity assumption which states that the variance of the errors should not be affected by the values of the independent variables. That is, the distribution of the errors should be entirely independent of any information contained within the design matrix:

The homoscedasticity assumption (image by author).

The final assumption is the no autocorrelation assumption which requires that error terms are uncorrelated. This implies that knowledge about one error term does not confer any information about another and therefore does not co-vary:

No autocorrelation assumption (image by author).

Under these assumptions, the covariance matrix for the error terms is a scalar matrix and the errors are considered to be spherical:

The covariance matrix under the Gauss-Markov assumptions (image by author).

Just a note before moving on, while I adopted the convention of normally distributed error, the Gauss-Markov theorem does not require the error terms to be normal, nor does it require that errors are independent and identically distributed; only that the error terms are homoscedastic and uncorrelated. This implies that a variable can have dependencies, but so long as those dependencies are not linear — which is what correlation measures — then parameter estimation can proceed safely.

Parameter Estimation via Ordinary Least Squares

When fitting a linear regression model to data the goal is to estimate the unknown model coefficients contained in β. The usual approach to finding the best values for these unknowns is to satisfy the least squares criterion, the goal of which is to minimize the total squared error between the linear predictor and the response. I’m going to step through how this is implemented for the simple linear regression model, though will move through this section fairly quickly.

In matrix form, the vector of errors, or residuals, is defined as follows:

Model error, or residuals (image by author).

where the hat on top of β denotes the estimated coefficients. The sum of the squared residuals can then be written as the dot product of the error vector with itself:

The sum squared error (image by author).

where T denotes the transpose operator¹. To derive the least squares criterion it’s convenient to write out the errors in full as follows:

Expanding the dot product of the error vector (image by author).

The idea, then, is to find the parameters that minimize this value. To do so, we need to take the derivative of the above with respect to the vector β and set this to zero:

The first derivative of the sum of squared error (image by author).

from which the normal equation can be derived:

The normal equation (image by author).

From here, we can find a solution for our unknown parameters by multiplying each side of the normal equation by the inverse of XX. I cover matrix inversion in this post, though if you haven’t read that, a matrix A is invertible if there exists a matrix B such that their product returns the identity matrix, I.

For a simple linear regression model, XX is square with a size of 2 × 2, though more generally the matrix will be a p × p matrix. We then need to find another 2 × 2 matrix that is the multiplicative inverse of XX. If such a matrix does not exist then the equation cannot be solved, though if XX is indeed invertible, then we can obtain the parameter vector b as follows:

The estimation equation for linear regression (image by author).

Here you can see why it is necessary that the design matrix is not rank deficient. If there are indeed linear dependencies between columns then the XX cannot be inverted, meaning a solution does not exist. This is particularly true if you have perfect multicollinearity.

A Closer Look at Parameter Estimation

It is interesting to further consider the elements contained in each matrix. First, let’s look at the cross-product of the design matrix XX:

Taking the cross product of the design matrix (image by author).

From here we can see that the matrix contains products of each column in the design matrix. But what we need is the inverse of this matrix. I won’t go through how to derive the inverse, but it looks like this:

The inverse of the design matrix cross product (image by author).

Finally, we also need the cross-product of the design matrix and the response vector Y, which produces the following:

The cross product of the design matrix with the response vector (image by author).

With the matrices written out in full, we can substitute them into the estimation question and work through the calculation like so:

Derivation of the parameter vector (image by author).

To be fair, there is a bit going on in that derivation, but it’s really the last line that is most interesting. What this all boils down to is something quite convenient; we can estimate the slope coefficient using the sample covariance and variance like so:

Estimate for the slope coefficient (image by author).

Once we have that, we can use this estimate, along with the mean of y and x, to derive the estimate for the intercept:

Estimate for the intercept coefficient (image by author).

Fitted Values

We have now used some linear algebra to find the best-fitting parameters for a simple linear regression model. Now that we have these in hand the next step is to see how well the fitted values correspond with the values contained in the response vector.

All we need to obtain the fitted values is the design matrix along with the parameter vector b. Multiplying together, the fitted values are computed like so:

Note that our fitted values have a hat placed on top of them to indicate that they’re estimated values. For this reason, an alternative way to express the fitted values is as a combination of the response vector and the hat matrix, which gets its name from the fact that it puts a hat on Y.

To see this, let’s write out the equation for the fitted values by substituting b with the full equation for the parameter vector:

Basically, every term that involves the design matrix X is lumped together to form the definition of the hat matrix:

Final Remarks

If you’ve made it to this point, thanks for sticking at it. There was a lot to unpack, for sure. However, I hope you can see how the principles of matrix algebra are used to build simple linear regression models. Now, while I have focussed on simple linear regression throughout to keep the examples concise, everything discussed here works the same way for multiple regression, too. All that changes is that the dimensionality of the matrices and vectors increases.

Footnotes

[1] The notation I have used here is different from what I used in my primer article, but they mean exactly the same thing. The reason for the change is to keep in step with how the matrix formulations are typically described.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsAlgebralatest newsLinearMatrixPhDRegressionRobTaylorTechnoblender
Comments (0)
Add Comment