DeepAR: Mastering Time-Series Forecasting with Deep Learning | by Nikos Kafritsas | Nov, 2022


Amazon’s autoregressive deep network

Created with Stable Diffusion [1]

A few years ago, time-series models worked on a single sequence only.

Hence, if we had multiple time-series, one option was to create one model per sequence. Or, if we could “tabularize” our data, we could apply the gradient-boosted tree models — which work very well even today.

The first model that could natively work on multiple time-series was DeepAR[2], an autoregressive recurrent network developed by Amazon.

In this article, we will see how DeepAR works in-depth and why it is a milestone for the time-series community.

If you want to learn about the other deep learning models that were inspired by DeepAR, check this article:

DeepAR is the first successful model to combine Deep Learning with traditional Probabilistic Forecasting.

Let’s see why DeepAR stands out:

  • Multiple time-series support: The model is trained on multiple time-series, learning global characteristics that further enhance forecasting accuracy.
  • Extra covariates: DeepAR allows extra features (covariates). For instance, if your task is temperature forecasting, you can include humidity-level, air-pressure etc.
  • Probabilistic output: Instead of making a single prediction, the model leverages quantile loss to output prediction intervals.
  • “Cold” forecasting: By learning from thousands of time-series that potentially share a few similarities, DeepAR can provide forecasts for time-series that have little or no history at all.

DeepAR uses LSTM networks to create probabilistic outputs.

Long Short-Term Memory Networks (LSTMs) are used in numerous time-series forecasting model architectures: For example, we can use:

  • Plain LSTMs
  • Multi-stacked LSTMs
  • LSTMs with CNNs
  • LSTMs with Time2Vec
  • LSTMs in encoder-decoder topology
  • LSTMs in encoder-decoder topology with attention [3] (Figure 1)
Figure 1: The Google Neural Machine Translation — GNMT architecture (Source)

Moreover, while it is true that Transformers dominate the NLP field, they don’t decisively outperform LSTMs in time-series-related tasks. The main reason is that LSTMs are more adept at handling local temporal data.

For more information regarding Recurrent networks vs Transformers, check this article.

Contrary to the previous models, DeepAR uses LSTMs a bit differently:

Instead of using LSTMs to calculate predictions directly, DeepAR leverages LSTMs to parameterize a Gaussian likelihood function. That is, to estimate the θ = (μ, σ) parameters (mean and standard deviation) of the Gaussian function.

Figure 2 and Figure 3 show the architecture overview of DeepAR in trainingand inference modes:

Figure 2: Mathematical operations in DeepAR during training (Source)

Let’s start with training. Suppose we are at the time step t of the time-series i :

  1. First, the LSTM cell takes as input the covariatesx_i,t of the current time step t and the target variable z_i,t-1 of the previous time step t-1 . Also, the LSTM receives the hidden state hi,t-1 of the previous time step.
  2. Then, the LSTM cell outputs its hidden state hi,t which is fed to the next step.
  3. The μ and σ values are indirectly computed from hi,t and ‘become’ the parameters of a Gaussian likelihood function p(y_i|θ_i)= l(z_i,t|Θι,t). The paper defines those parameters with the greek letter theta θ = (μ, σ). Don’t worry if you don’t understand this part — we will explain it later in more detail.
  4. In other words, the model tries to answer this: what are the best parameters μ and σ that construct a gaussian distribution which outputs predictions as close to the target variable z_i,t as possible?
  5. This concludes the training step t. The current target value z_i and hidden state hi,t are passed to the next time step and the training process continues. Since DeepAR trains (and predicts) a single data point each time, the model is called autoregressive.
Figure 3: Mathematical operations in DeepAR training inference (Source)

The steps for inference are pretty much the same.

One thing changes though: Now, at each inference step t, we use the predicted variable ž_i,t-1 that was sampled in the previous time step t-1 to calculate the new prediction ž_i,t.

Remember, the ž_i,t are now sampled from the gaussian distribution that our model has learned during training. However, our model does not learn the parameters μ and σ directly.

We will see how those parameters are calculated in the next section.

Before delving into how DeepAR’s autoregressive nature works, it is important to understand how the likelihood function works. If you are familiar with this concept, you can skip this section.

The goal of maximum likelihood estimation is to find the optimal parameters of a distribution that better explain our sample data.

Let’s assume our data follow the gaussian(normal) distribution. Each gaussian distribution is parameterized by the mean μ and standard deviation σ, that isθ = (μ, σ). Hence, the gaussian likelihood ℓ, given θ = (μ, σ) is defined as:

Now, take a look at Figure 4:

Figure 4: Maximum likelihood estimation of 2 Gaussian distributions (Image by author)

We have the green and orange data points, each following a different Gaussian distribution. Let’s assume you are given those data points and your goal is to estimate their two gaussian distributions.

More formally, the task is to find the best μ and σ of the two distributions that optimally fit those data (DeepAR assumes only one distribution). In statistics, this task is also called maximizing the gaussian log-likelihood function:

The function is maximized for all timesteps t[t…τmax] and i[1…N], with N being the total number of time-series in our dataset.

In statistics, the parameters μ and σ are normally estimated using the MLEformulas (maximum log-likelihood estimators) that are derived by differentiating the likelihood function.

We don’t do that here.

Instead, we let the LSTM and 2 Dense layers derive those parameters based on the model’s input. This process is shown in Figure 5:

Figure 5: Parameter calculation of μ and σ (Image by author)

The process of estimating μ and σ is straightforward:

  • First, the LSTM calculates its hidden state hi,t.
  • Then, hi,t passes through a dense layer W_μ to calculate the mean μ.
  • Likewise, the same hi,t passes through a second dense layer W_σ and calculate the mean σ.
  • Now we have the μ and σ. The model creates a gaussian distribution with those parameters and takes a sample. Then, the model checks how close this sample is to the actual observation z_i,t.
  • That concludes training for the time step t . The LSTM weights and the 2 Dense layers W_μ and W_σ are trained during backpropagation.

In other words, DeepAR computes μ and σ indirectly through hi,t, W_μ andW_σ . This is done to make their calculation possible through backpropagation.

During inference, we do not have a target variable z_i,t to compare. DeepAR has already learned all neural network weights and uses them to create the prediction ž_i,t.

That’s it! We have now seen how DeepAR works end-to-end.

In the following sections, we will explain a few more mechanisms of DeepAR.

Note: The estimated mean and standard deviation parameters are formally symbolized in statistics with μ hat and σ hat.

Dealing with multiple heterogeneous time-series is tricky.

Imagine a product sales forecasting scenario: One product may have sales in the order of hundreds, while a different product can have sales in the order of millions.

This tremendous difference among time-series with different magnitudes could potentially confuse the model. To overcome this, DeepAR introduces an auto-scaling mechanism. More specifically, the model calculates an item-dependent ν_ι to rescale the autoregressive inputs z_i,t . This is given from the following formula:

Hence, at each time step t, the autoregressive inputs z_i,t from the previous step are first scaled by this factor.

Note: The auto-scaling mechanism of DeepAR works very well. However, in practice, it is preferable to manually normalize our time-series first. Doing this will enhance our model’s performance.

In this section, we discuss how DeepAR competes with other models as well as its limitations.

Statistical models

The authors showed that DeepAR outperformed traditional statistical methods such as ARIMA. Also, the great advantage of DeepAR over those models is that it does not require extra feature preprocessing (e.g., making the time-series stationary first).

Amazon later released an updated version, called DeepVAR[4], which significantly improves performance. We will describe this model in a future article.

Deep Learning models

Since DeepAR was released, the research community has published numerous deep-learning models for time-series forecasting.

Not all of them can be directly compared to DeepAR because they work differently. To the best of my knowledge, the closest one that I can think of is Temporal Fusion Transformer (TFT) [5].

Let’s discuss two notable differences between DeepAR and TFT:

1. Multiple Time-Series
DeepAR calculates a separate embedding for each time-series. This embedding is then used as a feature for the LSTM and helps DeepAR to distinguish the different time-series.

TFT also utilizes LSTMs and works similarly. However, TFT uses those embeddings to configure the initial hidden state h_0 of the LSTM. This approach is much better because TFT properly conditions the LSTM cell on each time-series without altering the temporal dynamics.

2. Type of Forecasting
TFT is not an autoregressive model — it is classified as a multi-horizon forecasting model. Both types of models can output multi-step predictions. However, multi-horizon forecasting models produce predictions in one go, instead of providing them one by one like autoregressive models do.

The advantage of this approach is that multi-horizon forecasting models can create predictions for time steps for which their covariates don’t have any values. TFT excels in this category, as it is one of the most versatile models in terms of feature variety.

DeepAR is a remarkable Deep Learning model that constitutes a milestone for the time-series community.

Also, this model is prevalent in production: It is part of Amazon’s GluonTS [6] toolkit for time-series forecasting and can be trained on Amazon SageMaker.

In the next article, we will use DeepAR to create an end-to-end project.
Stay tuned!


Amazon’s autoregressive deep network

Created with Stable Diffusion [1]

A few years ago, time-series models worked on a single sequence only.

Hence, if we had multiple time-series, one option was to create one model per sequence. Or, if we could “tabularize” our data, we could apply the gradient-boosted tree models — which work very well even today.

The first model that could natively work on multiple time-series was DeepAR[2], an autoregressive recurrent network developed by Amazon.

In this article, we will see how DeepAR works in-depth and why it is a milestone for the time-series community.

If you want to learn about the other deep learning models that were inspired by DeepAR, check this article:

DeepAR is the first successful model to combine Deep Learning with traditional Probabilistic Forecasting.

Let’s see why DeepAR stands out:

  • Multiple time-series support: The model is trained on multiple time-series, learning global characteristics that further enhance forecasting accuracy.
  • Extra covariates: DeepAR allows extra features (covariates). For instance, if your task is temperature forecasting, you can include humidity-level, air-pressure etc.
  • Probabilistic output: Instead of making a single prediction, the model leverages quantile loss to output prediction intervals.
  • “Cold” forecasting: By learning from thousands of time-series that potentially share a few similarities, DeepAR can provide forecasts for time-series that have little or no history at all.

DeepAR uses LSTM networks to create probabilistic outputs.

Long Short-Term Memory Networks (LSTMs) are used in numerous time-series forecasting model architectures: For example, we can use:

  • Plain LSTMs
  • Multi-stacked LSTMs
  • LSTMs with CNNs
  • LSTMs with Time2Vec
  • LSTMs in encoder-decoder topology
  • LSTMs in encoder-decoder topology with attention [3] (Figure 1)
Figure 1: The Google Neural Machine Translation — GNMT architecture (Source)

Moreover, while it is true that Transformers dominate the NLP field, they don’t decisively outperform LSTMs in time-series-related tasks. The main reason is that LSTMs are more adept at handling local temporal data.

For more information regarding Recurrent networks vs Transformers, check this article.

Contrary to the previous models, DeepAR uses LSTMs a bit differently:

Instead of using LSTMs to calculate predictions directly, DeepAR leverages LSTMs to parameterize a Gaussian likelihood function. That is, to estimate the θ = (μ, σ) parameters (mean and standard deviation) of the Gaussian function.

Figure 2 and Figure 3 show the architecture overview of DeepAR in trainingand inference modes:

Figure 2: Mathematical operations in DeepAR during training (Source)

Let’s start with training. Suppose we are at the time step t of the time-series i :

  1. First, the LSTM cell takes as input the covariatesx_i,t of the current time step t and the target variable z_i,t-1 of the previous time step t-1 . Also, the LSTM receives the hidden state hi,t-1 of the previous time step.
  2. Then, the LSTM cell outputs its hidden state hi,t which is fed to the next step.
  3. The μ and σ values are indirectly computed from hi,t and ‘become’ the parameters of a Gaussian likelihood function p(y_i|θ_i)= l(z_i,t|Θι,t). The paper defines those parameters with the greek letter theta θ = (μ, σ). Don’t worry if you don’t understand this part — we will explain it later in more detail.
  4. In other words, the model tries to answer this: what are the best parameters μ and σ that construct a gaussian distribution which outputs predictions as close to the target variable z_i,t as possible?
  5. This concludes the training step t. The current target value z_i and hidden state hi,t are passed to the next time step and the training process continues. Since DeepAR trains (and predicts) a single data point each time, the model is called autoregressive.
Figure 3: Mathematical operations in DeepAR training inference (Source)

The steps for inference are pretty much the same.

One thing changes though: Now, at each inference step t, we use the predicted variable ž_i,t-1 that was sampled in the previous time step t-1 to calculate the new prediction ž_i,t.

Remember, the ž_i,t are now sampled from the gaussian distribution that our model has learned during training. However, our model does not learn the parameters μ and σ directly.

We will see how those parameters are calculated in the next section.

Before delving into how DeepAR’s autoregressive nature works, it is important to understand how the likelihood function works. If you are familiar with this concept, you can skip this section.

The goal of maximum likelihood estimation is to find the optimal parameters of a distribution that better explain our sample data.

Let’s assume our data follow the gaussian(normal) distribution. Each gaussian distribution is parameterized by the mean μ and standard deviation σ, that isθ = (μ, σ). Hence, the gaussian likelihood ℓ, given θ = (μ, σ) is defined as:

Now, take a look at Figure 4:

Figure 4: Maximum likelihood estimation of 2 Gaussian distributions (Image by author)

We have the green and orange data points, each following a different Gaussian distribution. Let’s assume you are given those data points and your goal is to estimate their two gaussian distributions.

More formally, the task is to find the best μ and σ of the two distributions that optimally fit those data (DeepAR assumes only one distribution). In statistics, this task is also called maximizing the gaussian log-likelihood function:

The function is maximized for all timesteps t[t…τmax] and i[1…N], with N being the total number of time-series in our dataset.

In statistics, the parameters μ and σ are normally estimated using the MLEformulas (maximum log-likelihood estimators) that are derived by differentiating the likelihood function.

We don’t do that here.

Instead, we let the LSTM and 2 Dense layers derive those parameters based on the model’s input. This process is shown in Figure 5:

Figure 5: Parameter calculation of μ and σ (Image by author)

The process of estimating μ and σ is straightforward:

  • First, the LSTM calculates its hidden state hi,t.
  • Then, hi,t passes through a dense layer W_μ to calculate the mean μ.
  • Likewise, the same hi,t passes through a second dense layer W_σ and calculate the mean σ.
  • Now we have the μ and σ. The model creates a gaussian distribution with those parameters and takes a sample. Then, the model checks how close this sample is to the actual observation z_i,t.
  • That concludes training for the time step t . The LSTM weights and the 2 Dense layers W_μ and W_σ are trained during backpropagation.

In other words, DeepAR computes μ and σ indirectly through hi,t, W_μ andW_σ . This is done to make their calculation possible through backpropagation.

During inference, we do not have a target variable z_i,t to compare. DeepAR has already learned all neural network weights and uses them to create the prediction ž_i,t.

That’s it! We have now seen how DeepAR works end-to-end.

In the following sections, we will explain a few more mechanisms of DeepAR.

Note: The estimated mean and standard deviation parameters are formally symbolized in statistics with μ hat and σ hat.

Dealing with multiple heterogeneous time-series is tricky.

Imagine a product sales forecasting scenario: One product may have sales in the order of hundreds, while a different product can have sales in the order of millions.

This tremendous difference among time-series with different magnitudes could potentially confuse the model. To overcome this, DeepAR introduces an auto-scaling mechanism. More specifically, the model calculates an item-dependent ν_ι to rescale the autoregressive inputs z_i,t . This is given from the following formula:

Hence, at each time step t, the autoregressive inputs z_i,t from the previous step are first scaled by this factor.

Note: The auto-scaling mechanism of DeepAR works very well. However, in practice, it is preferable to manually normalize our time-series first. Doing this will enhance our model’s performance.

In this section, we discuss how DeepAR competes with other models as well as its limitations.

Statistical models

The authors showed that DeepAR outperformed traditional statistical methods such as ARIMA. Also, the great advantage of DeepAR over those models is that it does not require extra feature preprocessing (e.g., making the time-series stationary first).

Amazon later released an updated version, called DeepVAR[4], which significantly improves performance. We will describe this model in a future article.

Deep Learning models

Since DeepAR was released, the research community has published numerous deep-learning models for time-series forecasting.

Not all of them can be directly compared to DeepAR because they work differently. To the best of my knowledge, the closest one that I can think of is Temporal Fusion Transformer (TFT) [5].

Let’s discuss two notable differences between DeepAR and TFT:

1. Multiple Time-Series
DeepAR calculates a separate embedding for each time-series. This embedding is then used as a feature for the LSTM and helps DeepAR to distinguish the different time-series.

TFT also utilizes LSTMs and works similarly. However, TFT uses those embeddings to configure the initial hidden state h_0 of the LSTM. This approach is much better because TFT properly conditions the LSTM cell on each time-series without altering the temporal dynamics.

2. Type of Forecasting
TFT is not an autoregressive model — it is classified as a multi-horizon forecasting model. Both types of models can output multi-step predictions. However, multi-horizon forecasting models produce predictions in one go, instead of providing them one by one like autoregressive models do.

The advantage of this approach is that multi-horizon forecasting models can create predictions for time steps for which their covariates don’t have any values. TFT excels in this category, as it is one of the most versatile models in terms of feature variety.

DeepAR is a remarkable Deep Learning model that constitutes a milestone for the time-series community.

Also, this model is prevalent in production: It is part of Amazon’s GluonTS [6] toolkit for time-series forecasting and can be trained on Amazon SageMaker.

In the next article, we will use DeepAR to create an end-to-end project.
Stay tuned!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsDeepDeepARforecastingKafritsaslatest newslearningMasteringNikosNovTechnologyTimeSeries
Comments (0)
Add Comment