Find the order of ARIMA models. Understand and find the best parameters… | by Betty LD | May, 2022

By Jessie Hobb On May 26, 2022

Understand and find the best parameters for your time-series basic modeling

ARIMA is one of the best models to start a univariate time series experiment. It delivers state-of-the-art performances, especially in the case of small datasets, where deep learning models are not yet at their best.

It is a simple, interpretable model but ARIMA is a parametric model. That means that it requires setting specific parameters before fitting the model. Indeed the Autoregressive, the Moving Average, and the stationarity part of the model respectively need the parameters p, q, and d.

In this article, you’ll understand what are these parameters and how to find optimal parameters with ACF, PACF plots, and BIC/AIC information criteria

First, we need a basic understanding of these models.

The equivalent for “linear regression” or “classical model” in time series are AR and MA. These fundamental models often show good performances. They are explainable and lightweight and make a good baseline to start any data science project.

There is a serious constraint with ARMA model. It assumes that the input data is stationary (mean and variance of the time series do not depend on the time at which the time series is being observed). If your data is non-stationary you can use an ARIMA model, where I stands for “Integrated”. The integration part accounts for “differentiating”, as a way to remove trends and make the time series stationary.

AR of order p is a model that regresses on its own p past values. In other words, the current value of the series Vt can be explained as a linear combination of the p past values, together with a random error. We assume that the next values are related to the p prior values. The number of preceding inputs used to predict the next value is called “order”. We usually refer to it as “p”. And these prior values used to compute the next value are called “lagged variables”.

The equation of Vt as a linear combination of its lagged variables is:

AR(p): Vt =ao +a1V{t-1}+…+apV{t-p}+Nt.

When we fit an AR model, we estimate the {ai} for i in[0,p]. Nt is a white noise process (we call a “white noise process” a series of random values with a mean 0 and constant variance and uncorrelated errors) and ao the mean.

For example, when p=3, in order to predict the next timestamp (and the time series mean a0), we plug in the last three values, respectively multiplied by the coefficient a1, a2, and a3.

You can see how the AR model of order p can be simulated with the code snippet below:

The model takes as input:

a list V of realizations of a random variable

rand_v_list = np.random.normal(size=N)

a list of p coefficients {ao, …, ap}.

The simulated AR model is multiplying p consecutive values of V by its p coefficients to create a time series.

Our goal is: given a simulated AR time series, how can we find p, the parameters used to generate it?

The Moving Average model uses the dependency between an observation and a residual error from a moving average applied to lagged observation. In other words, MA is modeling the value to forecast as a linear combination of the past error terms up to the q step. The error or residual is defined as res = pred value — true value.

It leverages previous forecast errors in a regression approach to forecast future observations. In that case, each future observation is a weighted average of the previous forecast error.

MA(q): Et =b0+b1*Et-1+…+bp*Et-q+Nt

We estimate the {bi} is in [0,q] and Nt is white noise (a collection of uncorrelated random variables with mean 0 and finite variance).

If there is still a correlation between the residual and the predicted value, it means that there is still information we can get from the error. In practice, we can consider the error to be random. A MA process is just a weighted combination of values E coming from a white noise process.

You can see how the model works with the code snippet below:

The model takes similar input as the AR process.

The AutoRegressive Moving Average ARMA(p,q) combines both AR(p) and MA(q) processes, considering the lagged or past values and the error terms of the time series. It is the most efficient linear model of stationary time series [1].

The AutoRegressive Integrated Moving Average is replace the ARMA model when the input data is not stationary.

In ARIMA, the Integration part “stationarizes” the time series [2]. When the order for integration is d=0, ARIMA behaves like an ARMA model. When d=1, the model will subtract the t-1 value from the value observed at t. You add a differentiation when there is a long-term trend in the data. The term integrated refers to how many times the modeled time series must be different to produce stationarity.

When you use an ARIMA model, don’t manually perform the differentiation, you can fit the ARIMA model directly on the nonstationary time series.

Now we understand how these processes work and what their order means, let’s see how to find the parameters in practice. In order to find the parameters p, d, and q for respectively the AR, integrated, and MA parts of the model, we can use:

ACF and PACF plots.
Domain knowledge
The various fit of metrics (BIC, AIC)

We remind that p is the number of lag observations in the model (also called the lag order), d is the number of times that the raw observations are differenced and q is the size of the moving average window.

We’ll present how to use the ACF, PACF plots, BIC, and AIC criteria in the next sections.

A crucial aspect of a time series process is autocorrelation. Autocorrelation is a statistical property that occurs when a time series is linearly related to a previous or lagged version of itself. It is also used to detect possible seasonality in a time series.

We use the autocorrelation function to assess the degree of dependence in the time series and select an appropriate model (MA, AR, or ARIMA).

The Auto Correlation Function (ACF) and Partial AutoCorrelation Function (PACF) can be computed for any time series (not only stationary).

In practice, we use the combination of both of these plots to determine the order of the ARMA process.

How do I know which process my time series follow?

Common processes:

ARIMA(0, 0, 0) is a white noise model. (Vt=Nt)
ARIMA(0, 1, 0) is a random walk (Vt-V{t-1}=c+Nt, c is mean)
ARIMA(0,1,1) exponential smoothing (Vt-V{t-1}=Et+a1*E{t-1})

But beyond common patterns, we use PACF and ACF plots to visualize which pattern to recognize:

Process identification table. Image by the author

In the figure below we show how the ACF and PACF behave for AR, MA, or ARMA processes in the case of an order 1. You can easily use this table to recognize for any order if your time series is an AR or MA or ARMA. In the case of AR or MA processes, you can count the number of peaks in the graph where there is a sudden fall. In both ACF and PACF the lag of order 0 is always equal to 1 (the value is completely correlated with itself), so we start counting after the first peak.

We use ACF, the Auto Correlation Function to understand how much the time series is correlated with its lagged values.

We use ACF to identify the MA(q) part of the model ARMA.

Indeed, we identify which lagged variable has a strong correlation with the output. These lagged variables will have a strong predictive power of the next value of a time series and should be used for forecasting.

In practice, we count the number of lags that have a significant autocorrelation with the output value to find q for a MA model.

Let’s simulate four MA processes with the order q=1, q=3, q=5, and q=7 for 500 simulated random observations.

And plot these four distinct simulated data:

Different MA processes with increasing order. Image by the author

Now, let’s use plot the ACF and PACF plot for the MA(q=5) simulated data.

We can see the damping decrease on the PACF plot (both positive and negative) in the right figure. Lags continue to be significant even after lag 10. While on the ACF plot on the left, after the 1+5 significant lags, all the following ones are nonsignificant. We can recognize a MA process.

In order to find q, we can count the number of significant lags (outside the blue band) on the ACF graph. Besides the first lag corresponding to the correlation to the value against itself, we can see a correlation with the next 5 lags of the series.

ACF and PACF for MA(q=5). We can read 5 significant or “high” peaks in the ACF, left figure. Image by the author

Intuition

The Partial AutoCorrelation Function (PACF) represents the correlation between two variables under the assumption that we consider the values of some other set of variables. In regression, this partial correlation could be found by correlating the residuals from two different regressions.

We use the PACF to find the correlation of the residuals with the next lags. If we discover hidden information in the residual and the next lags with reasonable correlation, we’ll use the next lag as a feature for modeling.

In other words, the autocorrelation value between a target timestamp and a previous timestamp consists of both the direct correlation between the two values plus the indirect correlation. We consider that these indirect correlations are linear functions of the correlation of the value under observation.

We can use the PACF to count the number of significant lags for an AR process.

Experiment

First let’s simulate the running average AR with p=1, p=3, p=5, p=7.

Different AR processes with variable order. Image by the author

We can see on the plot above that as we increase p, we also increase the oscillatory nature of the time series. When the time series is growing too much, there is this natural trend from the autoregressive component of bringing it back down its average value. Now it is oscillating around the mean 0 but could be any value. Because the time series is both taking negative and positive values, the ACF and PACF plots take positive and negative values too.

Now we can plot the ACF and PACF functions for these AR processes and read the number of significant lags as the value of p.

ACF and PACF for AR(p=7). We can read seven significant peaks on the PACF plot on the right. Image by the author.

Plotting ACF/PACF is effective for identifying AR and MA processes. But for ARIMA processes, it is more common to use the auto_arima functions. Auto arima is a brute-force method that tries different values of p and q while minimizing two criteria: AIC and BIC.

The most common metric to assess the regularized goodness-of-the-fit are:

Bayesian information criterion (BIC)
Akaike information criterion (AIC).

These metrics provide measures of model performance that account for model complexity. AIC and BIC combine a term reflecting how well the model fits the data with a term that penalizes the model in proportion to its number of parameters [3].

As a regularization technique, we want to penalize based on the number of parameters in the model. Indeed, the larger p and q, the more lags you use to predict the next value, and the more likely you are to overfit your data.

The auto-ARIMA process seeks to identify the most optimal parameters for an ARIMA model, settling on a single fitted ARIMA model. […]

In order to find the best model, auto-ARIMA optimizes for a given information_criterion, one of (‘aic’, ‘aicc’, ‘bic’, ‘hqic’, ‘oob’) (Akaike Information Criterion, Corrected Akaike Information Criterion, Bayesian Information Criterion, Hannan-Quinn Information Criterion, or “out of bag”–for validation scoring–respectively) and returns the ARIMA which minimizes the value.

In practice, we find the order of a time series automatically with the off-the-shelf tool auto_arima from the package pmdarima.

Let’s try auto_arima for finding the order of our simulated several MA processes:

q=1
Performing stepwise search to minimize aic
ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=1605.796, Time=0.01 sec
ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=1493.552, Time=0.03 sec
ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=1461.981, Time=0.03 sec
ARIMA(0,0,0)(0,0,0)[0]             : AIC=1604.553, Time=0.01 sec
ARIMA(1,0,1)(0,0,0)[0] intercept   : AIC=1463.723, Time=0.05 sec
ARIMA(0,0,2)(0,0,0)[0] intercept   : AIC=1463.755, Time=0.05 sec
ARIMA(1,0,2)(0,0,0)[0] intercept   : AIC=1465.600, Time=0.13 sec
ARIMA(0,0,1)(0,0,0)[0]             : AIC=1460.398, Time=0.02 sec
ARIMA(1,0,1)(0,0,0)[0]             : AIC=1462.121, Time=0.03 sec
ARIMA(0,0,2)(0,0,0)[0]             : AIC=1462.155, Time=0.02 sec
ARIMA(1,0,0)(0,0,0)[0]             : AIC=1491.861, Time=0.01 sec
ARIMA(1,0,2)(0,0,0)[0]             : AIC=1463.988, Time=0.06 secBest model:  ARIMA(0,0,1)(0,0,0)[0]          
Total fit time: 0.468 seconds
Optimal order for is: (0, 0, 1) 
q=3
Performing stepwise search to minimize aic
ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=1702.731, Time=0.01 sec
ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=1570.816, Time=0.03 sec
ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=1628.147, Time=0.04 sec
ARIMA(0,0,0)(0,0,0)[0]             : AIC=1701.862, Time=0.01 sec
ARIMA(2,0,0)(0,0,0)[0] intercept   : AIC=1528.848, Time=0.04 sec
ARIMA(3,0,0)(0,0,0)[0] intercept   : AIC=1519.618, Time=0.06 sec
ARIMA(4,0,0)(0,0,0)[0] intercept   : AIC=1485.096, Time=0.06 sec
ARIMA(4,0,1)(0,0,0)[0] intercept   : AIC=1484.876, Time=0.11 sec
ARIMA(3,0,1)(0,0,0)[0] intercept   : AIC=1509.277, Time=0.13 sec
ARIMA(4,0,2)(0,0,0)[0] intercept   : AIC=1464.510, Time=0.17 sec
ARIMA(3,0,2)(0,0,0)[0] intercept   : AIC=1465.074, Time=0.15 sec
ARIMA(4,0,3)(0,0,0)[0] intercept   : AIC=1465.187, Time=0.28 sec
ARIMA(3,0,3)(0,0,0)[0] intercept   : AIC=1464.135, Time=0.20 sec
ARIMA(2,0,3)(0,0,0)[0] intercept   : AIC=1462.726, Time=0.23 sec
ARIMA(1,0,3)(0,0,0)[0] intercept   : AIC=1462.045, Time=0.17 sec
ARIMA(0,0,3)(0,0,0)[0] intercept   : AIC=1460.299, Time=0.09 sec
ARIMA(0,0,2)(0,0,0)[0] intercept   : AIC=1507.915, Time=0.06 sec
ARIMA(0,0,4)(0,0,0)[0] intercept   : AIC=1462.121, Time=0.09 sec
ARIMA(1,0,2)(0,0,0)[0] intercept   : AIC=1467.963, Time=0.08 sec
ARIMA(1,0,4)(0,0,0)[0] intercept   : AIC=1463.941, Time=0.23 sec
ARIMA(0,0,3)(0,0,0)[0]             : AIC=1458.689, Time=0.12 sec
ARIMA(0,0,2)(0,0,0)[0]             : AIC=1506.487, Time=0.03 sec
ARIMA(1,0,3)(0,0,0)[0]             : AIC=1460.415, Time=0.11 sec
ARIMA(0,0,4)(0,0,0)[0]             : AIC=1460.498, Time=0.04 sec
ARIMA(1,0,2)(0,0,0)[0]             : AIC=1466.278, Time=0.07 sec
ARIMA(1,0,4)(0,0,0)[0]             : AIC=1462.305, Time=0.11 sec
Best model:  ARIMA(0,0,3)(0,0,0)[0]          
Total fit time: 2.717 seconds
Optimal order for is: (0, 0, 3) 
q=5
Performing stepwise search to minimize aic
ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=1659.497, Time=0.01 sec
ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=1570.804, Time=0.03 sec
ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=1613.884, Time=0.03 sec
ARIMA(0,0,0)(0,0,0)[0]             : AIC=1658.949, Time=0.01 sec
ARIMA(2,0,0)(0,0,0)[0] intercept   : AIC=1495.855, Time=0.04 sec
ARIMA(3,0,0)(0,0,0)[0] intercept   : AIC=1482.804, Time=0.05 sec
ARIMA(4,0,0)(0,0,0)[0] intercept   : AIC=1484.509, Time=0.07 sec
ARIMA(3,0,1)(0,0,0)[0] intercept   : AIC=1484.564, Time=0.11 sec
ARIMA(2,0,1)(0,0,0)[0] intercept   : AIC=1484.926, Time=0.07 sec
ARIMA(4,0,1)(0,0,0)[0] intercept   : AIC=1486.509, Time=0.15 sec
ARIMA(3,0,0)(0,0,0)[0]             : AIC=1481.204, Time=0.03 sec
ARIMA(2,0,0)(0,0,0)[0]             : AIC=1494.160, Time=0.02 sec
ARIMA(4,0,0)(0,0,0)[0]             : AIC=1482.892, Time=0.03 sec
ARIMA(3,0,1)(0,0,0)[0]             : AIC=1482.953, Time=0.05 sec
ARIMA(2,0,1)(0,0,0)[0]             : AIC=1483.270, Time=0.03 sec
ARIMA(4,0,1)(0,0,0)[0]             : AIC=1484.892, Time=0.08 sec
Best model:  ARIMA(3,0,0)(0,0,0)[0]          
Total fit time: 0.824 seconds
Optimal order for is: (3, 0, 0) 
q=7
Performing stepwise search to minimize aic
ARIMA(0,0,0)(0,0,0)[0] intercept   : AIC=2171.867, Time=0.01 sec
ARIMA(1,0,0)(0,0,0)[0] intercept   : AIC=1789.289, Time=0.03 sec
ARIMA(0,0,1)(0,0,0)[0] intercept   : AIC=1931.174, Time=0.04 sec
ARIMA(0,0,0)(0,0,0)[0]             : AIC=2172.420, Time=0.01 sec
ARIMA(2,0,0)(0,0,0)[0] intercept   : AIC=1788.083, Time=0.04 sec
ARIMA(3,0,0)(0,0,0)[0] intercept   : AIC=1779.499, Time=0.07 sec
ARIMA(4,0,0)(0,0,0)[0] intercept   : AIC=1778.438, Time=0.07 sec
ARIMA(4,0,1)(0,0,0)[0] intercept   : AIC=1773.792, Time=0.26 sec
ARIMA(3,0,1)(0,0,0)[0] intercept   : AIC=1780.497, Time=0.10 sec
ARIMA(4,0,2)(0,0,0)[0] intercept   : AIC=1695.057, Time=0.32 sec
ARIMA(3,0,2)(0,0,0)[0] intercept   : AIC=1738.073, Time=0.35 sec
ARIMA(4,0,3)(0,0,0)[0] intercept   : AIC=1691.378, Time=0.51 sec
ARIMA(3,0,3)(0,0,0)[0] intercept   : AIC=1711.992, Time=0.25 sec
ARIMA(4,0,4)(0,0,0)[0] intercept   : AIC=1694.119, Time=0.72 sec
ARIMA(3,0,4)(0,0,0)[0] intercept   : AIC=1701.593, Time=0.27 sec
ARIMA(4,0,3)(0,0,0)[0]             : AIC=1689.749, Time=0.27 sec
ARIMA(3,0,3)(0,0,0)[0]             : AIC=1710.347, Time=0.19 sec
ARIMA(4,0,2)(0,0,0)[0]             : AIC=1693.396, Time=0.15 sec
ARIMA(4,0,4)(0,0,0)[0]             : AIC=1692.698, Time=0.49 sec
ARIMA(3,0,2)(0,0,0)[0]             : AIC=1736.557, Time=0.16 sec
ARIMA(3,0,4)(0,0,0)[0]             : AIC=1699.989, Time=0.16 sec
Best model:  ARIMA(4,0,3)(0,0,0)[0]          
Total fit time: 4.481 seconds
Optimal order for is: (4, 0, 3)

We use auto arima on MA processes of orders 1,3,5 and 7. Auto_arima recognizes the MA process and its order accurately for small orders q=1 and q=3, but it is mixing AR and MA for orders q=5 and q=7.

When you start your time series analysis, it is a good practice to start with simple models that may satisfy the use case requirements. ARIMA models are simple and transparent and you can derive rigorous statistical properties. they are performant on small datasets and are cheap to build and retrain.

If you need to use them, you need to understand how they work and set explicitly their parameters. This article gave you the technique to tune your model order with sufficient confidence.

The notebook for this article is available here.

[1] Hands-on Time Series Analysis with Python: From Basics to Bleeding Edge Techniques https://www.amazon.com/Hands-Time-Analysis-Python-Techniques/dp/1484259912 (non-affiliated)

[2] https://people.cs.pitt.edu/~milos/courses/cs3750/lectures/class16.pdf

[3] https://www.sciencedirect.com/topics/psychology/bayesian-information-criterion#:~:text=The%20Akaike%20information%20criterion%20(AIC,to%20its%20number%20of%20parameters.