Deep GPVAR: Upgrading DeepAR For Multi-Dimensional Forecasting | by Nikos Kafritsas | Mar, 2023

By Jessie Hobb On Mar 24, 2023

Amazon’s new Time-Series Forecasting model

What is the most enjoyable thing when you read a new paper? For me, this is the following:

Imagine a popular model suddenly getting upgraded — with just a few elegant tweaks.

Three years after DeepAR [1], Amazon engineers published its revamped version, known as Deep GPVAR [2] (Deep Gaussian-Process Vector Auto-regressive).

This is a much-improved model of the original version. Plus, it is open-source. In this article, we discuss:

How Deep GPVAR works in depth.
How DeepAR and Deep GPVAR are different.
What problems does Deep GPVAR solve and why it’s better than DeepAR?
A hands-on tutorial on energy demand forecasting.

If you are interested in Time-Series Forecasting, check my list of the Best Deep Learning Forecasting Models.

Deep GPVAR is an autoregressive DL model that leverages low-rank Gaussian Processes to model thousands of time-series jointly, by considering their interdependencies.

Let’s briefly review the advantages of using Deep GPVAR :

Multiple time-series support: The model uses multiple time-series data to learn global characteristics, improving its ability to accurately forecast.
Extra covariates: Deep GPVAR allows extra features (covariates).
Scalability: The model leverages low-rank gaussian distribution to scale training to multiple time series simultaneously.
Multi-dimensional modeling: Compared to other global forecasting models, Deep GPVAR models time series together, rather than individually. This allows for improved forecasting by considering their interdependencies.

The last part is what differentiates Deep GPVAR from DeepAR. We will discuss this more in the next section.

A global model trained on multiple time series is not a new concept. But why the need for a global model?

At my previous company, where clients were interested in time-series forecasting projects, the main request was something like this:

“We have 10,000 time-series, and we would like to create a single model, instead of 10,000 individual models.”

The time series could represent product sales, option prices, atmospheric pollution, etc. — it doesn’t matter. What’s important here is that a company needs a lot of resources to train, evaluate, deploy, and monitor (for concept drift) 10,000 time series in production.

So, that is a good reason. Also, at that time, there was no N-BEATS or Temporal Fusion Transformer.

However, if we are to create a global model, what should it learn? Should the model just learn a clever mapping that conditions each time series based on the input? But, this assumes that time series are independent.

Or should the model learn global temporal patterns that apply to all time series in the dataset?

Deep GPVAR builds upon DeepAR by seeking a more advanced way to utilize the dependencies between multiple time series for improved forecasting.

For many tasks, this makes sense.

A model that considers time series of a global dataset as independent loses the ability to effectively utilize their relationships in applications such as finance and retail. For instance, risk-minimizing portfolios require a forecast of the covariance of assets, and a probabilistic forecast for different sellers must consider competition and cannibalization effects.

Therefore, a robust global forecasting model cannot assume the underlying time series are independent.

Deep GPVAR is differentiated from DeepAR in two things:

High-dimensional estimation: Deep GPVAR models time series together, factoring in their relationships. For this purpose, the model estimates their covariance matrix using a low-rank Gaussian approximation.
Scaling: Deep GPVAR does not simply normalize each time series, like its predecessor. Instead, the model learns how to scale each time series by transforming them first using Gaussian Copulas.

The following sections describe how these two concepts work in detail.

As we said earlier, one of the best ways to study the relationships of multiple time series is to estimate the covariate matrix.

However, scaling this task for thousands of time series is not easily accomplished — due to memory and numerical stability limitations. Plus, covariance matrix estimation is a time-consuming process — the covariance should be estimated for every time window during training.

To address this issue, the authors simplify the covariance estimation using low-rank approximation.

Let’s start with the basics. Below is the matrix form of a Multivariate normalN∼(μ,Σ) with mean μ ∈ (k,1) and covariance Σ ∈ (k,k)

**Equation 1:** Multivariate Gaussian distribution in matrix form

The problem here is the size of the covariance matrix Σ that is quadratic with respect to N, the number of time series in the dataset.

We can address this challenge using an approximated form, called the low-rank Gaussian approximation. This method has its roots in factor analysis and is closely related to SVD (Singular Value Decomposition).

Instead of computing the full covariance matrix of size (N,N), we can approximate by computing instead:

Equation 2: Low-rank covariance matrix formula

where D ∈ R(N,N) is a diagonal matrix and V ∈ R(N,r).

But why do we represent the covariance matrix using the low-rank format? Because sincer<<N, it is proved that the Gaussian likelihood can be computed using O(Nr² + r³) operations instead of O(N³) (the proof can be found in the paper’s Appendix).

The low-rank normal distribution is part of PyTorch’s distributions module. Feel free to experiment and see how it works:

# multivariate low-rank normal of mean=[0,0], cov_diag=[1,1], cov_factor=[[1],[0]]# covariance_matrix = cov_diag + cov_factor @ cov_factor.T
m = LowRankMultivariateNormal(torch.zeros(2), torch.tensor([[1.], [0.]]), 
torch.ones(2))
m.sample()  
#tensor([-0.2102, -0.5429])

Notation: From now on, all variables in bold are considered either vectors or matrices.

Now that we have seen how low-rank normal approximation works, we delve deeper into Deep GPVAR’s architecture.

First, Deep GPVAR is similar to DeepAR — the model also uses an LSTM network. Let’s assume our dataset contains N time series, indexed from i= [1…N]

At each time step t we have:

An LSTM cell takes as input the target variable z_t-1 of the previous time step t-1 for a subset of time series. Also, the LSTM receives the hidden state h_t-1 of the previous time step.
The model uses the LSTM to compute its hidden vector h_t.
The hidden vector h_t will now be used to compute theμ, Σ parameters of a multivariate Gaussian distribution N∼(μ,Σ). This is a special kind of normal distribution called Gaussian copula (More to that later).

This process is shown in Figure 1:

**Figure 1:** Two training steps of Deep GPVAR. The covariance matrix Σ is expressed as Σ=D+V*V^T (Source)

At each time step t, Deep GPVAR randomly choosesB << N time-series to implement the low-rank parameterization: On the left, the model chooses from (1,2 and 4) time series and on the right, the model chooses from (1,3 and 4).

The authors in their experiments have configured B = 20. With a dataset containing potentially over N = 1000 time series, the benefit of this approach becomes clear.

There are 3 parameters that our model estimates:

The means μ of the normal distribution.
The covariance matrix parameters are d and v according to Equation 2.

They are all calculated from the h_t LSTM hidden vector. Figure 2 shows the low-rank covariance parameters:

**Figure 2:** Parameterization of the low-rank covariance matrix, as expressed in **Equation 2**

Careful with notation: μ_i, d_i, v_i refer to thei-th time series in our dataset, where i ∈ [1..N].

For each time-series i, we create the y_i vector, which concatenates h_i with e_i — the e_i vector contains the features/covariates of the i-th time series. Hence we have:

Figure 3 displays a training snapshot for a time step t:

**Figure 3:** A detailed architecture of low-rank parameterization

Notice that:

μ and d are scalars, while v is vector. For example, μ = w_μ^T * y with dimensions (1,p)*(p,1), therefore μ is a scalar.
The same LSTM is used for all time series, including the dense layer projections w_μ ,w_d ,w_u.

Hence, the neural network parameters w_μ ,w_d ,w_uandy are used to compute the μ and Σ parameters which are shown in Figure 3.

Question: What does the μ and Σ parameterize?

They parameterize a special kind of multivariate Gaussian distribution, called Gaussian Copula.

But why does Deep GPVAR needs a Gaussian Copula?

Remember, Deep GPVAR does multi-dimensional forecasting, so we cannot use a simple univariate Gaussian, like in DeepAR.

Ok. So why not use our familiar multivariate Gaussian distribution instead of a Copula function — like the one shown in Equation 1?

2 reasons:

1) Because a multivariate Gaussian distribution requires gaussian random variables as marginals. We could also use mixtures, but they are too complex and not applicable in every situation. Conversely, Gaussian Copulas are easier to use and can work with any input distribution — and by input distribution, we mean an individual time series from our dataset.

Hence, the copula learns to estimate the underlying data distributions without making assumptions about the data.

2) The Gaussian Copula can model the dependency structure among these different distributions by controlling the parameterization of the covariance matrix Σ. That’s how Deep GPVAR learns to consider the interdependencies among input time series, something other global forecasting models don’t do.

Remember: time series can have unpredictable interdependencies. For example, in retail, we have product cannibalization: a successful product pulls demand away from similar items in its category. So, we should also factor in this phenomenon when we forecast product sales. With copulas, Deep GPVAR learns those interdependencies automatically.

What are copulas?

A copula is a mathematical function that describes the correlation between multiple random variables.

Note: If you are completely unfamiliar with copulas, this article explains in-depth what copulas are and how we construct them.

Copulas are heavily used in quantitative finance for portfolio risk assessment. Their misuse also played a significant role in the 2008 recession.

More formally, a copula function C is a CDF of N random variables, where each random variable (marginal) is uniformly distributed:

Figure 4 below shows the plot of a bivariate copula, consisting of 2 marginal distributions. The copula is defined in the [0–1]² domain (x, y-axis) and outputs values in [0–1] (z-axis):

**Figure 4:** A Gaussian copula CDF function consisting of 2 beta distributions as marginals

A popular choice for C is the Gaussian Copula — the copula of Figure 4 is also Gaussian.

How we construct copulas

We won’t delve into much detail here, but let’s give a brief overview.

Initially, we have a random vector — a collection of random variables. In our case, each random variable z_i represents the observation of a time series i at a time step t:

Then, we make our variables uniformly distributed using the probability integral transform: The CDF output of any continuous random variable is uniformly distributed,F(z)=U :

And finally, we apply our Gaussian Copula:

where Φ^-1 is the inverse standard gaussian CDF N∼(0,1), and φ (lowercase letter) is a gaussian distribution parameterized with μ and Σ . Note that Φ^-1[F(z)] = x, where x~(0,1) because we use the standard inverse CDF.

So, what happens here?

We can take any continuous random variable, marginalize it as uniform, and then transform it into a gaussian. The chain of operations is the following:

There are 2 transformations here:

F(z) = U, known as probability integral transform. Simply put, this transformation converts any continuous random variable to uniform.
Φ^-1(U)=x, known as inverse sampling. Simply put, this transformation converts any uniform random variable to the distribution of our choice — here Φ is gaussian, so x also becomes a gaussian random variable.

In our case, z are the past observations of a time series in our dataset. Because our model makes no assumptions about how the past observations are distributed, we use the empirical CDF — a special function that calculates the CDF of any distribution non-parametrically (empirically).

Note: If you are not familiar with the empirical CDF, refer to my article for a detailed explanation.

In other words, at the F(z) = U transformation, F is the empirical CDF and not the actual gaussian CDF of the variable z . The authors use m=100 past observations throughout their experiments to calculate the empirical CDF.

Recap of copulas

To sum up, the Gaussian copula is a multivariate function that uses μ and Σ to directly paremeterize the correlation of two or more random variables.

But how a Gaussian copula differs from a Gaussian multivariate probability distribution(PDF)? Besides, a Gaussian Copula is just a multivariate CDF.

The Gaussian Copula can use any random variable as a marginal, not just a Gaussian.
The original distribution of the data does not matter — using the probability integral transform and the empirical CDF, we can transform the original data to gaussian, no matter how they are distributed.

Now that we have seen how copulas work, it’s time to see how Deep GPVAR uses them.

Let’s go back to Figure 3. Using the LSTM, we have computed the μ and Σ parameters. What we do is the following:

Step 1: Transform our observations to Gaussian

Using the copula function, we transform our observed time-series datapoints z to gaussian x using the copula function. The transformation is expressed as:

where f(z_i,t) is actually the marginal transformation Φ^-1(F(z_i)) of the time-series i.

In practice, our model makes no assumptions about how our past observations z are distributed. Therefore, no matter how the original observations are distributed, our model can effectively learn their behavior, including their correlation — all these thanks to the power of Gaussian Copulas.

Step 2: Use the computed parameters for the Gaussian.

I mentioned that we should transform our observations to Gaussian, but what are the parameters of the Gaussian? In other words, when I said that f(z_i) = Φ^-1(F(z_i)), what are the parameters of Φ ?

The answer is the μ and Σ parameters — these are calculated from the dense layers and the LSTM shown in Figure 3.

Step 3: Calculate the loss and update our network parameters

To recap, we transformed our observations to Gaussian and we assume those observations are parameterized by a low-rank normal Gaussian. Thus, we have:

where f1(z1) is the transformed observed prediction for the first time series, f2(z2) refers to the second one and f_n(z_n) refers to the N-th time series of our dataset.

Finally, we train our model by maximizing the multivariate gaussian log-likelihood function. The paper uses the convention of minimizing the loss function — the gaussian log-likelihood preceded with a minus:

Using the gaussian log-likelihood loss, Deep GPVAR updates its LSTM and the shared dense layer weights displayed in Figure 3.

Also, notice:

The z is not a single observation, but the vector of observations from all N time-series at time t. The summation loops until T, the maximum lookup window — upon which the gaussian log-likelihood is evaluated.
And since z is a vector of observations, the gaussian log-likelihood is actually multivariate. In contrast, DeepAR uses a univariate gaussian likelihood.

I recommend going over the article several times to grasp how the model works.

Generally speaking, you can view Deep GPVAR as a Gaussian stochastic process.

For those unfamiliar with stochastic processes, you can think of a stochastic process as a collection of random variables — each variable represents the outcome of a random event at a given time.

The random variables are indexed by some set, usually time, and the values of the random variables depend on each other. Hence the outcome of one event can influence the outcome of future events. This interdependence also explains the temporal nature of a stochastic process.

In our case, each datapoint y_i of a time-series i can be viewed as a random variable. Hence, Deep GPVAR can be considered a Gaussian process GP, evaluated at every data point y as:

Now, the parameters of a gaussian process are functions, not variables. In the equation above, the μ, d, v (with tilde) are time functions, indexed by y — where y is an observation of vectors at a certain time step.

At each time step, the functions μ, d, v (with tilde) which essentially are the LSTM and the Dense layers have a realization μ, d, v (without tilde) . This realization parameterizes the μ and Σ of the copula function at a time step t. Hence during training, at each time-step, μ and Σ change, such that they optimally explain our observations.

Consequently, the notion that Deep GPVAR works as a Gaussian stochastic process is entirely justified.

From the current paper, Amazon created 2 models, Deep GPVAR (which we describe in this article) and DeepVAR.

DeepVAR is similar to Deep GPVAR. The difference is that DeepVAR uses a global multivariate LSTM that receives and predicts all time series at once. On the other hand, Deep GPVAR unrolls the LSTM on each time series separately.

In their experiments, the authors refer to the DeepVAR as Vec-LSTM and Deep GPVAR as GP.

The Vec-LSTM and GP terms are mentioned in Table 1 of the original paper.
The Deep GPVAR and DeepVAR terms are mentioned in Amazon’s Forecasting library Gluon TS.

This article describes the Deep GPVAR variant, which is better on average and has fewer parameters than DeepVAR. Feel free to read the original paper and learn more about the experimental process.

This tutorial uses the ElectricityLoadDiagrams20112014 [4] dataset from UCI. The notebook for this example can be downloaded here:

Note: Currently, the PyTorch Forecasting library provides the DeepVAR version. That’s the variant we show in this tutorial. You can also try all DeepAR variants, including GPVAR from Amazon’s open-souce Gluon TS library.

Download Data

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip
!unzip LD2011_2014.txt.zip

Data Preprocessing

data = pd.read_csv('LD2011_2014.txt', index_col=0, sep=';', decimal=',')
data.index = pd.to_datetime(data.index)
data.sort_index(inplace=True)
data.head(5)

Each column represents a consumer. The values in each cell are the electrical power usages in each quarter-of-an-hour.

Also, we aggregate into hourly data. Due to the model’s size and complexity, we train our model using 5 consumers only (considering only non-zero values).

data = data.resample('1h').mean().replace(0., np.nan)
earliest_time = data.index.min()
df=data[['MT_002', 'MT_004', 'MT_005', 'MT_006', 'MT_008' ]]

Next, we convert our dataset to a special format that PyTorch Forecasting understands — called TimeSeriesDataset. This format is handy because it lets us specify the nature of our features (e.g. if they are time-varying, or static).

Note: If you want to learn more about the TimeSeriesDataset format, check my Temporal Fusion Transformer article that explains in detail how this format works.

To modify our dataset for the TimeSeriesDataset format, we stack all time series vertically instead of horizontally. In other words, we convert our dataframe from “wide” to “long” format.

Also, we create new features from the date column such day and month -those will help our model capture the seasonality dynamics better

df_list = []for label in df:
ts = df[label]
start_date = min(ts.fillna(method='ffill').dropna().index)
end_date = max(ts.fillna(method='bfill').dropna().index)
active_range = (ts.index >= start_date) & (ts.index <= end_date)
ts = ts[active_range].fillna(0.)
tmp = pd.DataFrame({'power_usage': ts})
date = tmp.index
tmp['hours_from_start'] = (date - earliest_time).seconds / 60 / 60 + (date - earliest_time).days * 24
tmp['hours_from_start'] = tmp['hours_from_start'].astype('int')
tmp['days_from_start'] = (date - earliest_time).days
tmp['date'] = date
tmp['consumer_id'] = label
tmp['hour'] = date.hour
tmp['day'] = date.day
tmp['day_of_week'] = date.dayofweek
tmp['month'] = date.month
#stack all time series vertically
df_list.append(tmp)
time_df = pd.concat(df_list).reset_index(drop=True)
# match results in the original paper
time_df = time_df[(time_df['days_from_start'] >= 1096)
& (time_df['days_from_start'] < 1346)].copy()

The final preprocessed dataframe is called time_df. Let’s print its contents:

The time_df is now in the proper format for the TimeSeriesDataset. Moreover, the TimeSeriesDataset format requires a time index for our data. In this case, we use the hours_from_start variable. Also, the power_usage is the target variable our model will try to predict.

Create DataLoaders

In this step, we pass our time_df to the TimeSeriesDataSet format. We do this because:

It spares us from writing our own Dataloader.
We can specify how our model will handle the features.
We can normalize our dataset with ease. In our case, normalization is mandatory because all time sequences differ in magnitude. Thus, we use the GroupNormalizer to normalize each time series individually.

Our model uses a lookback window of one week (7*24) to predict the power usage of the next 24 hours.

Also, notice that the hours_from_start is both the time index and a time-varying feature. For the sake of demonstration, our validation set is the last day:

max_prediction_length = 24
max_encoder_length = 7*24
training_cutoff = time_df["hours_from_start"].max() - max_prediction_lengthtraining = TimeSeriesDataSet(
time_df[lambda x: x.hours_from_start <= training_cutoff],
time_idx="hours_from_start",
target="power_usage",
group_ids=["consumer_id"],
max_encoder_length=max_encoder_length,
max_prediction_length=max_prediction_length,
static_categoricals=["consumer_id"],
time_varying_known_reals=["hours_from_start","day","day_of_week", "month", 'hour'],
time_varying_unknown_reals=['power_usage'],
target_normalizer=GroupNormalizer(
groups=["consumer_id"]
),  # we normalize by group
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
)
validation = TimeSeriesDataSet.from_dataset(training, time_df, predict=True, stop_randomization=True, min_prediction_idx=training_cutoff + 1)
# create dataloaders for  our model
batch_size = 64 
# if you have a strong GPU, feel free to increase the number of workers  
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=2, batch_sampler="synchronized")
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=2, batch_sampler="synchronized")

Baseline Model

Also, remember to create a baseline model.

We create a naive baseline that predicts the power usage curve of the previous day:

actuals = torch.cat([y for x, (y, weight) in iter(val_dataloader)])
baseline_predictions = Baseline().predict(val_dataloader)
(actuals - baseline_predictions).abs().mean().item()# ➢25.139617919921875

Building and Training our Model

We are now ready to train our model. We will use the Trainer interface from the PyTorch Lightning library.

First, we have to configure hyperparameters and callbacks:

The EarlyStopping callback to monitor the validation loss.
Tensorboard to log our training and validation metrics.
The hidden_size and rnn_layers refer to the number of LSTN cells and the number of LSTM layers respectively.
We also use gradient_clip_val (gradient clipping) to prevent overfitting. Also, the model has a default dropout of 0.1 — we leave the default value.

Our loss function is theMultivariateNormalDistributionLoss(rank : int) . The function takes the rank parameter as an argument — this is the R value we explained in the beginning.

Remember, the lower the value, the biggest the speedup during training. The authors in their experiments use rank=10 — we do the same here:

pl.seed_everything(42)early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=4, verbose=True, mode="min")
lr_logger = LearningRateMonitor(logging_interval='step', log_momentum=True)  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard
trainer = pl.Trainer(
max_epochs=50,
accelerator='auto', 
devices=1,
enable_model_summary=True,
gradient_clip_val=0.1,
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
net = DeepAR.from_dataset(
training, 
learning_rate=0.001, 
hidden_size=40, 
rnn_layers=3, 
loss=MultivariateNormalDistributionLoss(rank=10)
)
trainer.fit(
net,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader
)

After 6 epochs, EarlyStopping kicks in the training finishes.

Note: If you want to run DeepAR instead of DeepVAR, use the NormalDistributionLoss instead.

Load and Save the Best Model

best_model_path = trainer.checkpoint_callback.best_model_path
print(best_model_path)
best_deepvar = DeepAR.load_from_checkpoint(best_model_path)#path:
#lightning_logs/lightning_logs/version_0/checkpoints/epoch=6-step=40495.ckpt

Don’t forget to download your model:

#zip and download the model
!zip  -r model.zip lightning_logs/lightning_logs/version_0/*

This is how you load the model again — you only have to remember the best model path:

!unzip model.zip
best_model_path='lightning_logs/lightning_logs/version_0/checkpoints/epoch=6-step=40495.ckpt'
best_deepvar = DeepAR.load_from_checkpoint(best_model_path)

Tensorboard Logs

Take a closer look at training and validation curves with Tensorboard. You can start it by executing:

# Start tensorboard
%load_ext tensorboard
%tensorboard --logdir lightning_logs

Model Evaluation

Get predictions on the validation set and calculate the average P50 (quantile median) loss:

actuals = torch.cat([y[0] for x, y in iter(val_dataloader)])
predictions = best_deepvar.predict(val_dataloader)#average p50 loss overall
print((actuals - predictions).abs().mean().item())
#average p50 loss per time series
print((actuals - predictions).abs().mean(axis=1))
# 6.78986
# tensor([ 1.1948,  6.9811,  2.7990,  8.3856, 14.5888])

Notice that we got a slightly worse score compared to the TFT implementation. The loss function we used is probabilistic — thus, each time you will get a slightly different score.

The last two time series have a slightly higher loss because their relative magnitude is high compared to the others.

Plot Predictions on Validation Data

raw_predictions, x = best_deepvar.predict(val_dataloader, mode="raw", return_x=True)for idx in range(5):  # plot all 5 consumers
fig, ax = plt.subplots(figsize=(10, 4))
best_deepvar.plot_prediction(x, raw_predictions, idx=idx, add_loss_to_title=True,ax=ax);

**Figure 5:** Predictions on validation data for MT_002

**Figure 6:** Predictions on validation data for MT_004

**Figure 7:** Predictions on validation data for MT_005

**Figure 8:** Predictions on validation data for MT_006

**Figure 9:** Predictions on validation data for MT_008

The predictions on the validation set are quite impressive.

Also, notice that we did not perform any hyperparameter tuning, meaning we could get even better results.

Plot Predictions For A Specific Time Series

Previously, we plotted predictions on the validation data using the idx argument, which iterates over all time series in our dataset. We can be more specific and output predictions of a specific time series:

fig, ax = plt.subplots(figsize=(10, 5))raw_prediction, x = best_deepvar.predict(
training.filter(lambda x: (x.consumer_id == "MT_004") & (x.time_idx_first_prediction == 26512)),
mode="raw",
return_x=True,
)
best_deepvar.plot_prediction(x, raw_prediction, idx=0, ax=ax);

**Figure 10:** Day ahead predictions for MT_004 on the training set

In Figure 10, we plot the day-ahead of the MT_004 consumer for time index=26512.

Remember, our time-indexing column hours_from_start starts from 26304 and we can get predictions from 26388 onwards (because we set earlier min_encoder_length=max_encoder_length // 2 which equals 26304 + 168//2=26388

Plotting the Covariance Matrix

The biggest advantage of Deep (GP)VAR is the estimation of the covariance matrix — which provides some insight about the interdepencies of the time series in our dataset.

In our case, it makes no sense to do this calculation because we only have 5 time series in our dataset. Nevertheless, here is the code that shows how to do it if you have a dataset with many time series:

cov_matrix = best_deepvar.loss.map_x_to_distribution(
best_deepvar.predict(val_dataloader, mode=("raw", "prediction"), n_samples=None)
).base_dist.covariance_matrix.mean(0)# normalize the covariance matrix diagnoal to 1.0
correlation_matrix = cov_matrix / torch.sqrt(torch.diag(cov_matrix)[None] * torch.diag(cov_matrix)[None].T)
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
ax.imshow(correlation_matrix, cmap="bwr");

Finally, the PyTorch Forecasting library provides additional features such as out-of-sample forecasts and hyperparameter tuning with Optuna. You can learn more about this in the TFT tutorial, where they are explained in-depth. Also, you can find the notebook for this tutorial here.

Deep GPVAR and its variants are a powerful family of TS forecasting, bearing the expertise of Amazon’s research.

The model addresses multivariate forecasting by considering the interdependencies among thousands of time series.

Also, if you want to learn more about the initial architecture of DeepAR, feel free to check this companion article.