Thorough examination of bias and variance in the linear regression | by Arnaud Capitaine | May, 2022

By Jessie Hobb On May 27, 2022

The connection between probability definition and machine learning explanation

Are variance and bias of an estimator mathematical definitions and machine learning explanations reconcilable? Source: Robot hand photo created by rawpixel.com — www.freepik.com

I had trouble to clearly understand the connection between the definitions of the variance and the bias of an estimator in my probability courses and the explanation from the machine learning courses I then followed. In the first case, the mathematical definitions are clear and leave no room for ambiguity. On the contrary, from the machine learning course perspectives, bias and variance were regularly explained using visual explanations, such as the well known illustration from Understanding the Bias-Variance Tradeoff Scott Fortmann-Roe’s article [2].

The bias-variance tradeoff inspired by [2]. Source: [1]

They may also be explained with general rules like the following extracts: “Both bias and variance are connected to the model’s complexity. Low complexity means high bias and low variance. Increased complexity means low bias and high variance” [3] or “The bias-variance tradeoff is a tradeoff between a complicated and simple model, in which an intermediate complexity is likely best” [4]. In a nutshell, I depicted my concern in the following figure.

Are variance and bias of an estimator mathematical definitions and machine learning explanations reconcilable? Source: [1]

In an effort to reconcile these two approaches (mathematical definitions and machine learning explanations), I performed an in-depth analysis of the linear regression.

In this post, I have analyzed the change of the bias and variance when adding explanatory variables (also named features) for a linear regression model, in both a Frequentist approach and a Bayesian one with L2 regularization. The latter assumes “β are […] random variables with a specified prior distribution” [5], a multivariate Normal distribution with 0 mean and a covariance matrix proportional to the identity matrix. From the machine learning general explanations, I expect the bias to reduce and the variance to increase. However, what do improving or worsening the bias and the variance mean when the estimator is not a single scalar? Can we compare the bias and the covariance of two vector estimators? For this reason, I have studied the bias and the variance of both a vector estimator and of a single estimator.

The mean squared error (MSE) of an estimator allows to compare estimators. When the estimator is a scalar the definition is clear. However, when the estimator is multidimensional, I have found the two following definitions:

The first one [6] is a matrix and the second one [7][8] a scalar. Since, at this point, I was not to select one of the two definitions, I considered them both, first the matrix definition and secondly the scalar definition. When considering the MSE matrix, the comparison of two estimators can be performed by analyzing the sign of the difference of the two MSE matrices. If the resulting matrix is:

positive, then the second vector estimator is better,
negative, then the first vector estimator is better,
neither positive nor negative, nothing can be concluded.

When plugging the linear regression solution in the two MSE definitions, the results can be split into two parts, a bias related term and a variance related one.

MSE decomposition for the matrix MSE definition

MSE decomposition for the scalar MSE definition

All mathematical proofs are located in a notebook there [1], all with a reproducible example where 7 of the 8 independent explanatory variables, X, have been generated from Normal and Gamma distributions (the 8th is a constant). The dependent variable, Y, is the linear combination of the explanatory variables with coefficients β and a random multivariate Normal noise ε: Y = Xβ+ε. Some of the coefficients have been set to 0 to consider the addition of ineffective explanatory variables in the linear regression. The bias and variance terms of the metrics have been analyzed when considering a increasing number of explanatory variables in the linear regression. For instance, the first model consider only one explanatory variable, the constant one. The estimations are then all the same for all the observations.

I welcome any feedback, correction, further information…

Multidimensional estimator

When considering the matrix MSE, I noticed that the variance term increases when adding new variables in a linear regression without regularization. Unfortunately, I did not observe anything else. If the matrix MSE is the appropriate metric to compare estimators, then the bias term does not systematically decrease when adding variables.

When considering the scalar MSE, I noticed more worth noting behaviors. In the Frequentist approach, the bias term increases when adding new variables, whether it is a useful one (the associated coefficient is non null) or not (the associated coefficient is null), conversely to the variance which increases. The bias term in the Bayesian approach is greater than the Frequentist one, contrary to the variance where the Bayesian one is the smaller. All the conclusions are illustrated in the figure below according to the reproducible example. The 2nd, 3rd, 7th and 8th variables are useless ones, i.e. the associated coefficients are all zeros. I deliberately added such variables to illustrate the change of bias and variance when adding ineffective variables. As proven, the bias term in the Frequentist approach decreases when adding new variables. The Bayesian bias is always greater than the Frequentist one. Similarly, the variance term in the Frequentist increases when adding new variables and the Bayesian variance is always lower than the Frequentist one. As expected, when all the explaining variables are considered, the bias term in the Frequentist approach is null, after the 6th variable.

Scalar MSE, bias and variance terms in both Bayesian and Frequentist approaches according to the number of explanatory variables. Source: [1]

Unidimensional estimator

When considering a single scalar estimator of the linear regression model, nothing can be concluded for the bias of the estimator. The variance of the estimator increases in the Frequentist approach and is greater than the variance in the Bayesian approach as illustrated below according to the same reproducible example.

Bias and variance of a single estimator of the linear regression in both Bayesian and Frequentist approaches according to the number of explanatory variables. Source: [1]

I only display the bias and the variance of the single estimator. If you are interested in visualizing the shape of distributions for a single prediction , I suggest that you have a look at this the “Bias and variance in linear models” post [9].

In order to better understand the connection between the bias and variance of an estimator and the bias and variance of a machine learning model, I analyzed the linear regression in both the Frequentist and Bayesian (with L2 regularization) approaches.

First, I have used two metrics to evaluate the bias and the variance of the linear regression: the matrix and the scalar MSE on the entire training estimator (over all the training points) which allow to compare two estimators in terms of both bias and variance. In the Frequentist approach, for the scalar MSE, I obtained the expected results: the bias term decreases and the variance term increases when adding new variables. Besides, regularization reduces the variance to the detriment of the bias.

Secondly, I have used the bias and variance of a scalar estimator to evaluate the bias and variance of a single prediction of the model. Even though the results are similar to previously for the variance, adding a variable does not guarantee to reduce the bias of a single estimator.

All my observations are summarized in the table below. I visually observe two behaviors that I was not able to prove, noted with a question mark in the table.

Change of the bias and variance terms when adding explanatory variables. Source: [1]

Variance and bias of an estimator mathematical definitions and machine learning explanations reconciliation. Source: Robot hand photo created by rawpixel.com — www.freepik.com

[1] Arnaud Capitaine, Thorough analysis of bias and variance in the linear regression, github
[2] Scott Fortmann-Roe, Understanding the Bias-Variance Tradeoff (2012)
[3] Ismael Araujo, How Bias and Variance Affect a Machine Learning Model (2020)
[4] Anthony Schams, Bias, Variance, and Regularization in Linear Regression: Lasso, Ridge, and Elastic Net — Differences and uses (2019)
[5] Linear regression, Wikipedia
[6] Jean-François Delmas, Introduction au calcul des probabilités et à la statistique (2010), Les Presses de l’ENSTA, VIII.4 p. 205
[7] Guy Lebanon, Bias, Variance, and MSE of Estimators (2010)
[8] Gersende Fort, Matthieu Lerasle, Eric Moulines, Statistique et Apprentissage (2020), I-4.2.1, p. 65
[9] Nischal M., Bias and variance in linear models (2019)