A Simple Approach to Hierarchical Time Series Forecasting with Machine Learning | by Leonie Monigatti | Mar, 2023

By Jessie Hobb On Mar 14, 2023

How to “boost” your cyclical sales data forecast with LightGBM and Python

Hierarchical time series forecasting (Image drawn by the author)

Welcome to another edition of “The Kaggle Blueprints,” where we will analyze Kaggle competitions’ winning solutions for lessons we can apply to our own data science projects.

This edition will review the techniques and approaches from the “M5 Forecasting — Accuracy” competition, which ended at the end of June 2020.

The objective of the “M5 Forecasting — Accuracy” competition was to forecast the next 28 days of 42,840 hierarchical time series of sales data.

Hierarchical time series — Unlike common multivariate time series problems, hierarchical time series can be aggregated on different levels: e.g., item level, store level, and state level. In this competition, the competitors were given over 40,000 time series of 3,000 individual products from 3 different categories, sold in 10 stores across 3 states.

Cyclical — Sales data is typically cyclical, which means that the sales data is time-dependent. E.g., you will see repeating patterns, like increasing sales around the end of the week (weekly cycle), at the beginning of a month (monthly cycle), or during the holidays (annual cycle).

Multistep — The task is to forecast the sales data 28 days into the future (28 steps).

To follow along in this article, your dataset should look something like this:

Insert your data here: How your hierarchical time series data should be formatted (Image by the author)

A popular approach among competitors was formulating the time series forecasting problem as a regression problem and modeling using Machine Learning (ML) [6].

A time series forecasting problem can be formulated as a regression problem by splitting the predictions into single steps — keeping the gap between the historical data and the prediction constant among data points.
Instead of feeding the sequence of past values to the ML model, you can aggregate the historical data points to historical features.

Time Series Forecasting as a regression problem (Image by the author)

Thus, the main steps to approach a hierarchical time series forecasting problem with ML are:

Building a Simple Baseline
Feature Engineering from Historical Data
Modeling and Validating a Time Series Forecasting Problem with Machine Learning

As with any good ol’ ML problem, we will start by building a simple baseline. With time series forecasting problems, a good starting point is to take the value from the last timestamp as the prediction — the naive approach.

You can improve the naive approach by referencing the last cycle if you have a cyclical time series. For example, if your time series depends on the weekday, you can take the last month, group by the weekday, and take the average [2].

Baseline for time series forecasting: naive approach (Image by the author)

In contrast to using a classical statistical approach, feature engineering is an essential step when developing an ML model. Thus, instead of feeding the historical data directly to the ML model, you will aggregate the historical data into historical features [4].

Timestamp features

A time series has at least two features: A timestamp and a value. Alone the timestamp can be used to create multiple new features.

First, you can extract features from the timestamp by simply dissembling it into its components, e.g., day, week, month, year, etc. [4].

# Convert to DateTime
df['date'] = pd.to_datetime(df['date'])# Make some features from date
df['day'] = df['date'].dt.day
df['week'] = df['date'].dt.week
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
# etc.

Second, you can create new features based on the date [1, 3]: Is it a weekday or weekend? Is it a holiday? Is a special event happening (e.g., a sports event)?

df['dayofweek'] = df['date'].dt.dayofweek
df['weekend'] = (df['dayofweek']>=5)
# etc.

Aggregation features

Next, you can create new features by aggregating the historical data and creating statistical features like the maximum, minimum, standard deviation, and mean [1, 3, 4, 8, 10].

Because we are working with a hierarchical time series, we will group the time series by different LEVEL (e.g., store_id).

FEATURE = 'price'
LEVEL_1 = 'store_id'
LEVEL_N = 'item_id'# Basic aggregations
df[f'{FEATURE}_max'] = df.groupby([LEVEL_1, LEVEL_N])[FEATURE].transform('max')
df[f'{FEATURE}_min'] = df.groupby([LEVEL_1, LEVEL_N])[FEATURE].transform('min')
df[f'{FEATURE}_std'] = df.groupby([LEVEL_1, LEVEL_N])[FEATURE].transform('std')
df[f'{FEATURE}_mean'] = df.groupby([LEVEL_1, LEVEL_N])[FEATURE].transform('mean')
# Normalization (min/max scaling)
df[f'{FEATURE}_norm'] = df[FEATURE]/df[f'{FEATURE}_max']
# Some items are can be inflation dependent and some items are very "stable"
df[f'{FEATURE}_nunique'] = df.groupby([LEVEL_1, LEVEL_N])[FEATURE].transform('nunique')
# Feature "momentum" 
df[f'{FEATURE}_momentum'] = df[FEATURE]/df.groupby([LEVEL_1, LEVEL_N])[FEATURE].transform(lambda x: x.shift(1))

Lag features

A popular feature engineering technique for time series data is to create lagged features [4, 5, 10]. To be able to use this feature on the testing data, the lag should be larger than the time gap between training and testing data.

LEVEL = 'store_id'
TARGET = 'sales'
lag = 7df[f"lag_{lag}"] = df.groupby(LEVEL)[TARGET].shift(lag).fillna(0)

Rolling features

Another popular feature engineering technique for time series data is to create features based on a rolling window (e.g., mean or standard deviation) [1, 3, 10].

You can apply this feature engineering technique to the FEATURE directly or even to the lagged version of it.

Mean of rolling window of 28 days — Mean of a rolling window of 28 days (Image by the author)

window = 28df[f"rolling_mean_{window}"] = df.groupby(LEVEL)[FEATURE].transform(lambda x : x.rolling(window).mean()).fillna(0)

Hierarchy as categorical features

When working with hierarchical time series, you can also include the node identifiers of the different levels of the hierarchy (e.g., store_id, item_id) as categorical variables [1, 3].

Your resulting dataframe should look something like this before we feed it to the ML model:

Training data structure for training an ML (GBDT) model for time series forecasting (Image by the author)

A few differences exist between modeling and validating a regular ML problem (e.g., regression or classification) and a hierarchical time series forecasting problem with ML.

Modeling multivariate and hierarchical time series

Modeling a hierarchical time series problem is similar to modeling a multivariate one.

Modeling multivariate time series — Autoregressive and sequence-to-sequence models can usually only model one time series (univariate time series problem) at once. Thus, when encountering a multivariate time series problem (like hierarchical time series), you would have to build multiple forecasting models — one model for each time series.

Many competitors used LightGBM, an ML model and gradient-boosting framework, for modeling [1, 3, 5, 7, 8, 10]. When using LightGBM, you can model multiple time series with a single LightGBM model instead of building multiple forecasting models

Modeling strategies for multivariate time series (Image by the author)

Since the time series data is hierarchical, many competitors grouped similar time series by hierarchy level (e.g., by store) and modeled them together [3, 8, 10].

Modelling strategy for hierarchical time series forecasting with Machine Learning — Modeling strategy for hierarchical time series forecasting with Machine Learning (Image by the author)

Validating forecasting models

When validating a time series forecasting model, it is crucial to keep the timely order of the time series in mind [6]. If you used the popular KFold cross-validation strategy, you would use future data to predict past events. When forecasting, you must avoid leaking future information to make predictions about the past.

Avoid leaking future information to make predictions about the past in time series forecasting validation (Image by the author)

Instead, you should define a few cross-validation periods and then train a model with all the data before that period [3, 8, 10]. E.g., for each week (VALIDATION_PERIOD = 7) of the last month (N_FOLDS = 4).

Cross Validation for Time Series Forecasting (Image by the author)

To put everything together, you can use the following code snippet for reference:

from datetime import datetime, timedelta
import lightgbm as lgbN_FOLDS = 4
VALIDATION_PERIOD = 7
for store_id in STORES_IDS:
for fold in range(N_FOLDS):
training_date = train_df['timestamp'].max() - timedelta(VALIDATION_PERIOD) * (N_FOLDS-fold)
valid_date = training_date + timedelta(VALIDATION_PERIOD)
print(f"\nFold {fold}: \ntraining data from {train_df['timestamp'].min()} to {training_date}\nvalidation data from {training_date + timedelta(1)} to {valid_date}")
train = train_df[train_df['timestamp'] <= training_date]
val  = train_df[(train_df['timestamp'] > training_date) & (train_df['timestamp'] <= valid_date) ]
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
train_data = lgb.Dataset(X_train, label = y_train)
valid_data = lgb.Dataset(X_val, label = y_val)
estimator = lgb.train(lgb_params,
train_data,
valid_sets = [valid_data],
verbose_eval = 100,
)Mo

When evaluating a hierarchical time series forecasting model, it might make sense to create a simple dashboard [9] to analyze the model’s performance on each level.

There are many more lessons to be learned from reviewing the learning resources Kagglers have created during the course of the “M5 Forecasting — Accuracy” competition. There are also many different solutions for this type of problem statement.

In this article, we focused on the general approach that was popular among many competitors: Formulating the time series forecasting problem as a regression problem, engineering features from historical data, and then applying an ML model to it.

This article uses synthetical data since the original competition dataset is only available for non-commercial use. The time series used in this article are generated from the sum of a sine wave, a linear function, and a white noise signal.

Subscribe for free to get notified when I publish a new story.

Become a Medium member to read more stories from other writers and me. You can support me by using my referral link when you sign up. I’ll receive a commission at no extra cost to you.

Find me on LinkedIn, Twitter, and Kaggle!

[1] Alan Lahoud (2020). 5th place solution in Kaggle Discussions (accessed March 7th, 2023)

[2] Chris Miles (2020). Simple model: avg last 28 days grouped by weekday in Kaggle Notebooks (accessed March 6th, 2023)

[3] Eugene Tang (2020). 7th place solution in Kaggle Discussions (accessed March 7th, 2023)

[4] Konstantin Yakovlev (2020). M5 — Simple FE in Kaggle Notebooks (accessed March 7th, 2023)

[5] Konstantin Yakovlev (2020). M5 — Three shades of Dark: Darker magic in Kaggle Notebooks (accessed March 7th, 2023)

[6] LogicAI (2023). Kaggle Days Paris 2022_Jean Francois Puget_Sales forecasting and fraud detection on YouTube. (accessed 21. February 2023)

[7] Matthias (2020). 2nd place solution in Kaggle Discussions (accessed March 7th, 2023)

[8 ] monsaraida (2020). 4th place solution in Kaggle Discussions (accessed March 7th, 2023)

[9] Tomonori Masui (2020). M5 — WRMSSE Evaluation Dashboard in Kaggle Notebooks (accessed March 7th, 2023)

[10] Yeonjun In (2020). 1st place solution in Kaggle Discussions (accessed March 7th, 2023)