Multilevel Regression with R. Understanding the Hierarchical Linear… | by Gustavo Santos | May, 2023

By Jessie Hobb On May 15, 2023

Regression models are out there for a long time now, much longer before Machine Learning was a thing. Statisticians have been using these models to understand the relationship between variables even before the 1900s, when Sir Francis Galton (1885) introduced the idea.

Fortunately, the theory has developed so much since then, and so have the computers and technology, up to the point that we can say that it is an easy (if not the easiest) model to create these days.

However, do not be fooled by its easiness to implement. It is easy, but not that simple. There are more than one kind of regression models available, not just Ordinary Least Squares (OLS).

The one we will talk about in this post is the Hierarchical Linear Model (HLM), or simply put, the multilevel regression.

Before we dive into the content, let’s get familiar with the dataset to be used in this example and the libraries we will need to code along.

Load the following libraries. Remembering that if you don’t have any of those installed, just use install.packages("library_name").

library(nlme)
library(tidyverse)
library(ggridges)

The dataset Gasoline is available in many R libraries. For this example, it was accessed via library nlme from R, under the license GPL 2.0. It is originally from Prater, N. H. (1955), Estimate gasoline yields from crudes, Petroleum Refiner, 35.

This data frame has the variables:

yield: percentage of crude oil converted to gasoline
endpoint: temperature (degrees F) at which all the gasoline is vaporized
Sample: crude oil sample number
API: crude oil gravity (degrees API)
vapor: the vapor pressure of the crude oil (lbf/in2 )
ASTM: the temperature at which 10% of the crude oil has become vapor.

# Dataset
data("Gasoline")

Objective: our goal will be to predict the yield number, or how much of the oil will be converted to gasoline from a sample.

A small review about OLS

In the OLS regression model, we take one or more explanatory variables [X] and try to use those numbers to explain the behavior of the subject measured by the response variable [y].

The best statistical test for feature selection for OLS is correlation. After all, one of the most used measures of how good the line fits the data is the R², which is nothing more than the correlation squared (Side note: there are other measures for model performance that should be used together with R², for sure).

Thus, we aim to draw a line that best fits the data. In other words, we want this line to go along with the data points, drawing sort of an average value for the response variable. Additionally, we want that the difference between the points in this line and the real data points will be the minimum possible.

Being a little more technical, the response variable will be determined by an equation where we want to optimize the values of an intercept point alpha on the y axis and the slope of this line, which is determined by the beta coefficient, just as described below.

Equation of the Simple Linear Regression. Image by the author.

If we look at our dataset, this is what the OLS model will do:

OLS Regression model created out of the Gasoline Dataset. Image by the author.

It’s a good fit, but not the best model. It can somehow explain the variance of yield, but we can do much better, as we will see next.

An HLM differs from the traditional OLS models because it takes in consideration the natural grouping of the data.

Multilevel Regression is like creating one different regression model for each group in your data.

From our data, what immediately calls my attention is that there are different Sample. So, it raises the questions:

Would the different samples impact the estimate of yield?
If we add multiple levels of regression by sample, do we have better results?

Extract of the Gasoline dataset. Image by the author.

Components

The HLM models are composed of the fixed effects and the random effects components. The fixed effects can estimate the relationship between the X variables and y, while the random effects component will determine different coefficients for intercept and slope for each group, helping to create an estimate that is more appropriate for each group.

See below how it would look like the multilevel regression model.

Multilevel model HLM2, with level 1= observation and level 2= Sample. Image by the author.

In practice, what happens is that a regression model is created to calculate the fixed component, which is a y = a + bx for each observation of each group. Then, in sequence, it is calculated the random component, that is an adjustment for more or less, according to the best intercept and/or slope for that group.

The HLM models can have random components varying only the intercept, only the slope, or both. In the previous figure, the random component is for both intercept and slope, given that the regression lines by sample are not parallel and have different intercept points.

Let’s now create an OLS model and an HLM with 2 levels and then compare the results.

We can run a correlation test to know the best variables to use for the regression, first. We’ll see that endpoint is the variable with the highest correlation with yield.

# Check for the best correlations
cor(Gasoline[,-3])yield   endpoint        API      vapor       ASTM
yield     1.0000000  0.7115262  0.2463260  0.3840706 -0.3150243
endpoint  0.7115262  1.0000000 -0.3216782 -0.2979843  0.4122466
API       0.2463260 -0.3216782  1.0000000  0.6205867 -0.7001539
vapor     0.3840706 -0.2979843  0.6205867  1.0000000 -0.9062248
ASTM     -0.3150243  0.4122466 -0.7001539 -0.9062248  1.0000000

The OLS model is pretty straightforward to be created in R. The native function lm() can handle that easily. Let’s see.

# OLS model
model_ols <- lm(yield ~ endpoint, data=Gasoline)
summary(model_ols)[OUT]
Call:
lm(formula = yield ~ endpoint, data = Gasoline)
Residuals:
Min       1Q   Median       3Q      Max 
-14.7584  -6.2783   0.0525   5.1624  17.8481 
Coefficients:
Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16.66206    6.68721  -2.492   0.0185 *  
endpoint      0.10937    0.01972   5.546 4.98e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.659 on 30 degrees of freedom
Multiple R-squared:  0.5063, Adjusted R-squared:  0.4898 
F-statistic: 30.76 on 1 and 30 DF,  p-value: 4.983e-06
# Calculate Log Likelihood
logLik(model_ols)
'log Lik.' -109.5206 (df=3)

Results: The model is statistically significant (p-value for the F Test <0.05), and the coefficients are also significant.

R²: 50%
Loglik: -109.52
MAE: 5.95 (yield mean = 19.66)
RMSE: 7.41 (standard deviation: 10.72)

Our main absolute error is about 30% of the mean value of y. It’s too much. Like we saw in the figure with the single regression line, the line is well positioned, but it can’t completely capture the variance, as there are many points that are too far away from the average.

Now let’s try running an HLM.

# Multilevel Model with fixed slope and random intercept
model_multi <- lme(fixed = yield ~ endpoint,
random = ~ 1 | Sample,
data = Gasoline,
method = "REML")[OUT]
Linear mixed-effects model fit by REML
Data: Gasoline 
AIC      BIC   logLik
183.4306 189.0354 -87.7153
Random effects:
Formula: ~1 | Sample
(Intercept) Residual
StdDev:    8.387862  1.88046
Fixed effects:  yield ~ endpoint 
Value Std.Error DF   t-value p-value
(Intercept) -33.30626  3.287182 21 -10.13216       0
endpoint      0.15757  0.005703 21  27.62678       0
Correlation: 
(Intr)
endpoint -0.582
Standardized Within-Group Residuals:
Min         Q1        Med         Q3        Max 
-1.6759908 -0.4857902 -0.0397993  0.5128182  1.8518205 
Number of Observations: 32
Number of Groups: 10

Here’s what’s going on in this code. We have ran a multilevel model using the function lme from the nlme library. The fixed component is the regression yield ~ endpoint. The random component is only for intercept. Notice that random = ~1 | Sample. The ~1 means fixed slope — or parallel regression lines by group — and random intercept calculated by Sample.

Instead of a simple y = a + bx, this model becomes:

Fixed: [intercept + b*endpoint] +
Random: [ intercept random effect by Sample + error]

Comparing it to the OLS model, notice how it improved.

Loglik: -87.7153 (the higher the number, the better).
MAE: 1.16
RMSE: 1.52

Wow. It improved a lot already: around 20% improvement in the LogLik measure.

Can we still improve it even more? We can try to add another variable to the model. We will choose vapor, as it is the next strongest correlation with our response variable.

This time, we will run an HLM with random component for slope and intercept, meaning we will allow our model to vary where it cross the y axis and how inclined is the fitted line, instead of leaving it static as parallel lines.

In the code snippet below, we are calculating a fixed component (regression by observation, by Sample) of the yield in function of endpoint and vapor. The random component will calculate the slope of our regression based on the endpoint values and the intercept based on the Sample. This next model will be:

Fixed: [intercept + b*endpoint] +
Random: [ intcpt random effect by Sample + slope random effect * endpoint + error]

# Multilevel Model with random slope and intercept
model_multi2 <- lme(fixed = yield ~ endpoint + vapor,
random = ~ endpoint | Sample,
data = Gasoline,
method = "REML")# Look how the fitted values are presented
# It is divided in Fixed and the result with the levels
# "Sample" is the final prediction, with the random effects considered
model_multi2$fitted
fixed     Sample
1   9.9034036  6.2878876
2  17.4490277 16.9089759
3   6.2864457  8.2047743
4  13.2051879 10.0773886
5  -0.3240537  6.1951732

Just so I won’t be unfair with the OLS model, I also ran another model with the vapor variable. Next, we will compare all the models.

Comparison of the 4 models created. Image by the author.

Despite the model OLS with endpoint+vapor gave us an interesting LogLik value, let’s look it closer in the predictions to have a sense of which model is a better fit.

# Results table
results <- data.frame(
yield = Gasoline$yield,
pred_ols = model_ols$fitted.values,
pred_ols2 = model_ols2$fitted.values,
pred_multi = model_multi$fitted[,'Sample'],
pred_multi2 = model_multi2$fitted[,'Sample']yield  pred_ols  pred_ols2 pred_multi pred_multi2
1    6.9  9.040133 11.2681120  5.8234477   6.2878876
2   14.4 16.914846 17.8195910 17.5516334  16.9089759
3    7.4  6.524599  8.0634010  8.6490752   8.2047743
4    8.5 23.258365 13.5848551  9.9282663  10.0773886
5    8.0  7.180825  1.9380928  6.1502045   6.1951732
6    2.8  9.040133 -0.2448399 -0.2263362   0.6677289
7    5.0 14.508684  5.1154648  5.7778750   5.0504530
8   12.2  5.759002 13.7816308 12.3278463  11.9321042
9   10.0 12.540005 13.3171528 10.1678972  10.3006218
10  15.2 16.149249 20.3249040 16.0654477  16.3795097
11  26.8 23.477107 26.1797067 27.0057873  27.2203084
12  14.0 21.727171 17.5245089 13.0730720  13.5617889
13  14.7 24.789559 15.5355488 12.1342356  12.1958142
14   6.4 13.414973  5.3285706  6.0764330   6.4725490
15  17.6 23.258365 16.2622858 18.3834134  18.1474338
16  22.3 13.414973 23.5350992 23.3576925  23.3378940

The OLS model 2 did a very good job, but the multilevel regressions are more consistent. Look at the fitted values for rows 4, 5, 9, 10, 12, for example, where both multilevel models will give us better estimates. However, despite these errors, the OLS2 is not a bad model at all for this data.

This final visual can give us a nice comparative view of the models. No doubt that the best fit is the last HLM model on the bottom right corner.

In this tutorial, I wanted to give you a nice introduction to the Hierarchical Linear Models, or multilevel regression models. They can be very handy when you have data that is naturally nested in groups, and these groups will affect how the estimates will be done.

With Multilevel models, you can allow the regression lines to vary their slope and/or intercept by group.

The package to be used for modeling with HLM in R is the nlme.

Here is the code for this exercise, in GitHub.

If you liked this content, don’t forget to follow me.

Also, if you’re considering joining Medium as a member, so you can access my articles and thousands of other good content, here’s my referral code for you .

You can find me on LinkedIn and you can book a quick chat with me via TopMate.

Fávero, L. & Belfiore, P. 2022. Manual de Análise de Dados. 1 ed. LTC, Rio de Janeiro, Brazil.