Amazon’s autoregressive deep network
A few years ago, time-series models worked on a single sequence only.
Hence, if we had multiple time-series, one option was to create one model per sequence. Or, if we could “tabularize” our data, we could apply the gradient-boosted tree models — which work very well even today.
The first model that could natively work on multiple time-series was DeepAR[2], an autoregressive recurrent network developed by Amazon.
In this article, we will see how DeepAR works in-depth and why it is a milestone for the time-series community.
If you want to learn about the other deep learning models that were inspired by DeepAR, check this article:
DeepAR is the first successful model to combine Deep Learning with traditional Probabilistic Forecasting.
Let’s see why DeepAR stands out:
- Multiple time-series support: The model is trained on multiple time-series, learning global characteristics that further enhance forecasting accuracy.
- Extra covariates: DeepAR allows extra features (covariates). For instance, if your task is temperature forecasting, you can include
humidity-level
,air-pressure
etc. - Probabilistic output: Instead of making a single prediction, the model leverages quantile loss to output prediction intervals.
- “Cold” forecasting: By learning from thousands of time-series that potentially share a few similarities, DeepAR can provide forecasts for time-series that have little or no history at all.
DeepAR uses LSTM networks to create probabilistic outputs.
Long Short-Term Memory Networks (LSTMs) are used in numerous time-series forecasting model architectures: For example, we can use:
- Plain LSTMs
- Multi-stacked LSTMs
- LSTMs with CNNs
- LSTMs with Time2Vec
- LSTMs in encoder-decoder topology
- LSTMs in encoder-decoder topology with attention [3] (Figure 1)
Moreover, while it is true that Transformers dominate the NLP field, they don’t decisively outperform LSTMs in time-series-related tasks. The main reason is that LSTMs are more adept at handling local temporal data.
For more information regarding Recurrent networks vs Transformers, check this article.
Contrary to the previous models, DeepAR uses LSTMs a bit differently:
Instead of using LSTMs to calculate predictions directly, DeepAR leverages LSTMs to parameterize a Gaussian likelihood function. That is, to estimate the θ = (μ, σ)
parameters (mean and standard deviation) of the Gaussian function.
Figure 2 and Figure 3 show the architecture overview of DeepAR in trainingand inference modes:
Let’s start with training. Suppose we are at the time step t
of the time-series i
:
- First, the LSTM cell takes as input the covariates
x_i,t
of the current time stept
and the target variablez_i,t-1
of the previous time stept-1
. Also, the LSTM receives the hidden statehi,t-1
of the previous time step. - Then, the LSTM cell outputs its hidden state
hi,t
which is fed to the next step. - The
μ
andσ
values are indirectly computed fromhi,t
and ‘become’ the parameters of a Gaussian likelihood functionp(y_i|θ_i)= l(z_i,t|Θι,t)
. The paper defines those parameters with the greek letter thetaθ = (μ, σ)
. Don’t worry if you don’t understand this part — we will explain it later in more detail. - In other words, the model tries to answer this: what are the best parameters
μ
andσ
that construct a gaussian distribution which outputs predictions as close to the target variablez_i,t
as possible? - This concludes the training step
t
. The current target valuez_i
and hidden statehi,t
are passed to the next time step and the training process continues. Since DeepAR trains (and predicts) a single data point each time, the model is called autoregressive.
The steps for inference are pretty much the same.
One thing changes though: Now, at each inference step t
, we use the predicted variable ž_i,t-1
that was sampled in the previous time step t-1
to calculate the new prediction ž_i,t
.
Remember, the ž_i,t
are now sampled from the gaussian distribution that our model has learned during training. However, our model does not learn the parameters μ
and σ
directly.
We will see how those parameters are calculated in the next section.
Before delving into how DeepAR’s autoregressive nature works, it is important to understand how the likelihood function works. If you are familiar with this concept, you can skip this section.
The goal of maximum likelihood estimation is to find the optimal parameters of a distribution that better explain our sample data.
Let’s assume our data follow the gaussian(normal) distribution. Each gaussian distribution is parameterized by the mean μ
and standard deviation σ
, that isθ = (μ, σ)
. Hence, the gaussian likelihood ℓ, given θ = (μ, σ)
is defined as:
Now, take a look at Figure 4:
We have the green and orange data points, each following a different Gaussian distribution. Let’s assume you are given those data points and your goal is to estimate their two gaussian distributions.
More formally, the task is to find the best μ
and σ
of the two distributions that optimally fit those data (DeepAR assumes only one distribution). In statistics, this task is also called maximizing the gaussian log-likelihood function:
The function is maximized for all timesteps t
⋹ [t…τmax]
and i
⋹ [1…N]
, with N
being the total number of time-series in our dataset.
In statistics, the parameters μ
and σ
are normally estimated using the MLEformulas (maximum log-likelihood estimators) that are derived by differentiating the likelihood function.
We don’t do that here.
Instead, we let the LSTM and 2 Dense layers derive those parameters based on the model’s input. This process is shown in Figure 5:
The process of estimating μ
and σ
is straightforward:
- First, the LSTM calculates its hidden state
hi,t
. - Then,
hi,t
passes through a dense layerW_μ
to calculate the meanμ
. - Likewise, the same
hi,t
passes through a second dense layerW_σ
and calculate the meanσ
. - Now we have the
μ
andσ
. The model creates a gaussian distribution with those parameters and takes a sample. Then, the model checks how close this sample is to the actual observationz_i,t
. - That concludes training for the time step
t
. The LSTM weights and the 2 Dense layersW_μ
andW_σ
are trained during backpropagation.
In other words, DeepAR computes μ
and σ
indirectly through hi,t,
W_μ
andW_σ
. This is done to make their calculation possible through backpropagation.
During inference, we do not have a target variable z_i,t
to compare. DeepAR has already learned all neural network weights and uses them to create the prediction ž_i,t
.
That’s it! We have now seen how DeepAR works end-to-end.
In the following sections, we will explain a few more mechanisms of DeepAR.
Note: The estimated mean and standard deviation parameters are formally symbolized in statistics with
μ hat
andσ hat
.
Dealing with multiple heterogeneous time-series is tricky.
Imagine a product sales forecasting scenario: One product may have sales in the order of hundreds, while a different product can have sales in the order of millions.
This tremendous difference among time-series with different magnitudes could potentially confuse the model. To overcome this, DeepAR introduces an auto-scaling mechanism. More specifically, the model calculates an item-dependent ν_ι
to rescale the autoregressive inputs z_i,t
. This is given from the following formula:
Hence, at each time step t
, the autoregressive inputs z_i,t
from the previous step are first scaled by this factor.
Note: The auto-scaling mechanism of DeepAR works very well. However, in practice, it is preferable to manually normalize our time-series first. Doing this will enhance our model’s performance.
In this section, we discuss how DeepAR competes with other models as well as its limitations.
Statistical models
The authors showed that DeepAR outperformed traditional statistical methods such as ARIMA. Also, the great advantage of DeepAR over those models is that it does not require extra feature preprocessing (e.g., making the time-series stationary first).
Amazon later released an updated version, called DeepVAR[4], which significantly improves performance. We will describe this model in a future article.
Deep Learning models
Since DeepAR was released, the research community has published numerous deep-learning models for time-series forecasting.
Not all of them can be directly compared to DeepAR because they work differently. To the best of my knowledge, the closest one that I can think of is Temporal Fusion Transformer (TFT) [5].
Let’s discuss two notable differences between DeepAR and TFT:
1. Multiple Time-Series
DeepAR calculates a separate embedding for each time-series. This embedding is then used as a feature for the LSTM and helps DeepAR to distinguish the different time-series.
TFT also utilizes LSTMs and works similarly. However, TFT uses those embeddings to configure the initial hidden state h_0
of the LSTM. This approach is much better because TFT properly conditions the LSTM cell on each time-series without altering the temporal dynamics.
2. Type of Forecasting
TFT is not an autoregressive model — it is classified as a multi-horizon forecasting model. Both types of models can output multi-step predictions. However, multi-horizon forecasting models produce predictions in one go, instead of providing them one by one like autoregressive models do.
The advantage of this approach is that multi-horizon forecasting models can create predictions for time steps for which their covariates don’t have any values. TFT excels in this category, as it is one of the most versatile models in terms of feature variety.
DeepAR is a remarkable Deep Learning model that constitutes a milestone for the time-series community.
Also, this model is prevalent in production: It is part of Amazon’s GluonTS [6] toolkit for time-series forecasting and can be trained on Amazon SageMaker.
In the next article, we will use DeepAR to create an end-to-end project.
Stay tuned!
Amazon’s autoregressive deep network
A few years ago, time-series models worked on a single sequence only.
Hence, if we had multiple time-series, one option was to create one model per sequence. Or, if we could “tabularize” our data, we could apply the gradient-boosted tree models — which work very well even today.
The first model that could natively work on multiple time-series was DeepAR[2], an autoregressive recurrent network developed by Amazon.
In this article, we will see how DeepAR works in-depth and why it is a milestone for the time-series community.
If you want to learn about the other deep learning models that were inspired by DeepAR, check this article:
DeepAR is the first successful model to combine Deep Learning with traditional Probabilistic Forecasting.
Let’s see why DeepAR stands out:
- Multiple time-series support: The model is trained on multiple time-series, learning global characteristics that further enhance forecasting accuracy.
- Extra covariates: DeepAR allows extra features (covariates). For instance, if your task is temperature forecasting, you can include
humidity-level
,air-pressure
etc. - Probabilistic output: Instead of making a single prediction, the model leverages quantile loss to output prediction intervals.
- “Cold” forecasting: By learning from thousands of time-series that potentially share a few similarities, DeepAR can provide forecasts for time-series that have little or no history at all.
DeepAR uses LSTM networks to create probabilistic outputs.
Long Short-Term Memory Networks (LSTMs) are used in numerous time-series forecasting model architectures: For example, we can use:
- Plain LSTMs
- Multi-stacked LSTMs
- LSTMs with CNNs
- LSTMs with Time2Vec
- LSTMs in encoder-decoder topology
- LSTMs in encoder-decoder topology with attention [3] (Figure 1)
Moreover, while it is true that Transformers dominate the NLP field, they don’t decisively outperform LSTMs in time-series-related tasks. The main reason is that LSTMs are more adept at handling local temporal data.
For more information regarding Recurrent networks vs Transformers, check this article.
Contrary to the previous models, DeepAR uses LSTMs a bit differently:
Instead of using LSTMs to calculate predictions directly, DeepAR leverages LSTMs to parameterize a Gaussian likelihood function. That is, to estimate the θ = (μ, σ)
parameters (mean and standard deviation) of the Gaussian function.
Figure 2 and Figure 3 show the architecture overview of DeepAR in trainingand inference modes:
Let’s start with training. Suppose we are at the time step t
of the time-series i
:
- First, the LSTM cell takes as input the covariates
x_i,t
of the current time stept
and the target variablez_i,t-1
of the previous time stept-1
. Also, the LSTM receives the hidden statehi,t-1
of the previous time step. - Then, the LSTM cell outputs its hidden state
hi,t
which is fed to the next step. - The
μ
andσ
values are indirectly computed fromhi,t
and ‘become’ the parameters of a Gaussian likelihood functionp(y_i|θ_i)= l(z_i,t|Θι,t)
. The paper defines those parameters with the greek letter thetaθ = (μ, σ)
. Don’t worry if you don’t understand this part — we will explain it later in more detail. - In other words, the model tries to answer this: what are the best parameters
μ
andσ
that construct a gaussian distribution which outputs predictions as close to the target variablez_i,t
as possible? - This concludes the training step
t
. The current target valuez_i
and hidden statehi,t
are passed to the next time step and the training process continues. Since DeepAR trains (and predicts) a single data point each time, the model is called autoregressive.
The steps for inference are pretty much the same.
One thing changes though: Now, at each inference step t
, we use the predicted variable ž_i,t-1
that was sampled in the previous time step t-1
to calculate the new prediction ž_i,t
.
Remember, the ž_i,t
are now sampled from the gaussian distribution that our model has learned during training. However, our model does not learn the parameters μ
and σ
directly.
We will see how those parameters are calculated in the next section.
Before delving into how DeepAR’s autoregressive nature works, it is important to understand how the likelihood function works. If you are familiar with this concept, you can skip this section.
The goal of maximum likelihood estimation is to find the optimal parameters of a distribution that better explain our sample data.
Let’s assume our data follow the gaussian(normal) distribution. Each gaussian distribution is parameterized by the mean μ
and standard deviation σ
, that isθ = (μ, σ)
. Hence, the gaussian likelihood ℓ, given θ = (μ, σ)
is defined as:
Now, take a look at Figure 4:
We have the green and orange data points, each following a different Gaussian distribution. Let’s assume you are given those data points and your goal is to estimate their two gaussian distributions.
More formally, the task is to find the best μ
and σ
of the two distributions that optimally fit those data (DeepAR assumes only one distribution). In statistics, this task is also called maximizing the gaussian log-likelihood function:
The function is maximized for all timesteps t
⋹ [t…τmax]
and i
⋹ [1…N]
, with N
being the total number of time-series in our dataset.
In statistics, the parameters μ
and σ
are normally estimated using the MLEformulas (maximum log-likelihood estimators) that are derived by differentiating the likelihood function.
We don’t do that here.
Instead, we let the LSTM and 2 Dense layers derive those parameters based on the model’s input. This process is shown in Figure 5:
The process of estimating μ
and σ
is straightforward:
- First, the LSTM calculates its hidden state
hi,t
. - Then,
hi,t
passes through a dense layerW_μ
to calculate the meanμ
. - Likewise, the same
hi,t
passes through a second dense layerW_σ
and calculate the meanσ
. - Now we have the
μ
andσ
. The model creates a gaussian distribution with those parameters and takes a sample. Then, the model checks how close this sample is to the actual observationz_i,t
. - That concludes training for the time step
t
. The LSTM weights and the 2 Dense layersW_μ
andW_σ
are trained during backpropagation.
In other words, DeepAR computes μ
and σ
indirectly through hi,t,
W_μ
andW_σ
. This is done to make their calculation possible through backpropagation.
During inference, we do not have a target variable z_i,t
to compare. DeepAR has already learned all neural network weights and uses them to create the prediction ž_i,t
.
That’s it! We have now seen how DeepAR works end-to-end.
In the following sections, we will explain a few more mechanisms of DeepAR.
Note: The estimated mean and standard deviation parameters are formally symbolized in statistics with
μ hat
andσ hat
.
Dealing with multiple heterogeneous time-series is tricky.
Imagine a product sales forecasting scenario: One product may have sales in the order of hundreds, while a different product can have sales in the order of millions.
This tremendous difference among time-series with different magnitudes could potentially confuse the model. To overcome this, DeepAR introduces an auto-scaling mechanism. More specifically, the model calculates an item-dependent ν_ι
to rescale the autoregressive inputs z_i,t
. This is given from the following formula:
Hence, at each time step t
, the autoregressive inputs z_i,t
from the previous step are first scaled by this factor.
Note: The auto-scaling mechanism of DeepAR works very well. However, in practice, it is preferable to manually normalize our time-series first. Doing this will enhance our model’s performance.
In this section, we discuss how DeepAR competes with other models as well as its limitations.
Statistical models
The authors showed that DeepAR outperformed traditional statistical methods such as ARIMA. Also, the great advantage of DeepAR over those models is that it does not require extra feature preprocessing (e.g., making the time-series stationary first).
Amazon later released an updated version, called DeepVAR[4], which significantly improves performance. We will describe this model in a future article.
Deep Learning models
Since DeepAR was released, the research community has published numerous deep-learning models for time-series forecasting.
Not all of them can be directly compared to DeepAR because they work differently. To the best of my knowledge, the closest one that I can think of is Temporal Fusion Transformer (TFT) [5].
Let’s discuss two notable differences between DeepAR and TFT:
1. Multiple Time-Series
DeepAR calculates a separate embedding for each time-series. This embedding is then used as a feature for the LSTM and helps DeepAR to distinguish the different time-series.
TFT also utilizes LSTMs and works similarly. However, TFT uses those embeddings to configure the initial hidden state h_0
of the LSTM. This approach is much better because TFT properly conditions the LSTM cell on each time-series without altering the temporal dynamics.
2. Type of Forecasting
TFT is not an autoregressive model — it is classified as a multi-horizon forecasting model. Both types of models can output multi-step predictions. However, multi-horizon forecasting models produce predictions in one go, instead of providing them one by one like autoregressive models do.
The advantage of this approach is that multi-horizon forecasting models can create predictions for time steps for which their covariates don’t have any values. TFT excels in this category, as it is one of the most versatile models in terms of feature variety.
DeepAR is a remarkable Deep Learning model that constitutes a milestone for the time-series community.
Also, this model is prevalent in production: It is part of Amazon’s GluonTS [6] toolkit for time-series forecasting and can be trained on Amazon SageMaker.
In the next article, we will use DeepAR to create an end-to-end project.
Stay tuned!