Avoid This Pitfall When Using LASSO and Ridge Regression | by Wouter van Heeswijk, PhD | Jul, 2022

By Jessie Hobb On Jul 7, 2022

In linear regression, LASSO and Ridge regularization are commonly applied to combat overfitting and generate more robust models. However, when applied on incorrectly scaled variables, the results can be disastrous.

Suppose we have a very simple linear model, in which we predict life expectancy y based on marital status x_1 ∈ {0,1} (1 means married) and annual salary x_2 ∈ R^+ (real nonnegative number). The model looks like this:

y=β_0+β_1x_1+β_2x_2+ϵ

If unfamiliar with the terminology: y is the outcome/independent variable, β_0 the constant, β_i x_i represents a weighted explanatory/dependent variable, and ϵ is the error term/random noise.

How do we express the salary x_2? We could express it in dollars ($), thousands of dollars ($k), or even cents ($c) if we like to. There is no inherent right or wrong here.

Although the unit of the dependent variable is user-determined, it does not really matter for the outcome. If we decide to change x_2 from $ to $k, we simply multiply β_2 by 1000 to preserve the same outcome y. Similarly, if we work with cents, we divide β_2 by 100 to retain the outcome.

No bumps on the road so far, but let’s move on.

Linear regression has a number of problems, such as overfitting. Overfitting implies that a model performs well on the training set, but poorly on out-of-sample data. Rather than recognizing true data patterns, an overfitted model tends to fit variables to noise.

More precisely, we deal with a bias-variance tradeoff, with a high bias implying we do not capture relevant patterns, and high variance implying we fit to non-existent patterns. So-called regularization techniques aid in balancing bias and variance, with LASSO and Ridge regulation being the most commonly deployed techniques.

For a more in-depth discussion of LASSO and Ridge regularization, please check the following article:

To combat overfitting. LASSO adds the following penalty (marked in red) to the basic regression procedure:

LASSO regularization puts a penalty on absolute β values, resulting in many weights being set to 0.

As absolute β values are penalized, LASSO has the tendency to set many weights to 0. The idea is that this procedure weeds out the less relevant variables, preserving only the most salient ones.

Ridge regularization applies a squared penalty to β:

Ridge regularization puts a penalty on squared `β` values, resulting in relatively small (but non-zero) weights.

As squaring a β<1 decreases its value (i.e., 0.0⁵²=0.0023), ridge regression tends to distribute weights more evenly among β values, rather than setting them to 0 completely. This should result in more robust predictions.

LASSO and ridge seem to be sensible techniques, but they only consider the values of β, disregarding the corresponding variable x. If we have the salary in cents, LASSO might perceive the corresponding beta as small and insignificant, whereas a salary in $k yields a β that is 100,000 times larger! Clearly, the chosen unit has a massive impact on the behavior. In one case the variable seems highly relevant, in the other case it appears negligible, despite β_i⋅x_i yielding the same outcome in both cases.

When applying LASSO or Ridge, we penalize large β values. As the unit of numerical variables is often arbitrary, the regularization impact can be substantial. Thus, absent appropriate transformations, LASSO and Ridge should not be applied.

So, having identified the problem, what can we do to address it? As is often the case, there is no one-size-fits-all solution, but there are fairly standard ones.

A common solution in Machine Learning is to rescale inputs to a [0,1] range. To achieve this, we can simply divide all data by the largest value in the set. If the data is somewhat reasonably distributed, this works pretty well.

However, suppose one lucky bastard in the data set is making 20 million $ per year. Being the highest value, this entry would be set equal to 1 ($20m/$20m). For the other people, we likely get very low x values, e.g., 60$k/$20m=0.003. A single outlier can completely throw of the scaling.

When dealing with such distributions, it is common to perform a log transformation. In this case, we would get log(20m)=7.30103 and log(60k)=4.30103. Still a substantial difference, but nowhere near the same magnitude as before. The variables would now scale to 1 and 0.59 respectively, and the multiplication with β has a more comprehensive impact.

Arguably the most reasonable solution is to transform variables in a way that sets the expected value of all x_i equal to 1 (or some other constant). Note this also means we would rescale dummy variables (which have a mean 0 ≤ E(x_i) ≤ 1). If necessary, centering values around the same means can be combined with a log transformation.

Once every variable has the same mean, we have directly comparable β values. After this transformation, LASSO and Ridge will target the right variables, no longer hindered by arbitrary settings of cents and dollars.

In a regression model, the unit of variables determines the size of the corresponding β values. To preserve the same relation between dependent variables x and independent variable y, β values decrease when increasing the unit of x, and vice versa.
In linear regression, the chosen variable unit often has limited impact, as the magnitude of β offsets the chosen unit. The outcome of the term β⋅ x remains unchanged, contributing equally to output y.
Linear regression models are often prone to overfitting. Regularization techniques such as LASSO and Ridge introduce penalties that mitigate this problem.
As LASSO and Ridge penalize β values, the chosen unit for variables x is highly relevant. Large x yield small β and vice versa, thus the unit directly affects the penalty mechanism. The impact may be considerable.
The most appropriate solution is to scale all x variables to the same expected mean (if needed after a log transformation), such that all β are of the same magnitude and can be compared directly. In this case, LASSO and Ridge penalize appropriately.