The Critical Role of Loss Function Selection in Creating Accurate Time Series Forecasts | by John Adeojo | Mar, 2023

By Jessie Hobb On Mar 15, 2023

Mastering Time Series Forecasting with Machine Learning

How your choice of loss function can make or break your time series forecasts

In this post I’ll be demonstrating to you the importance of something that I believe is often overlooked in machine learning, the choice of loss function. I will do this by walking you through my approach to the Dengue Fever competition hosted by Driven Data.

I have built a Ridge regressor as my baseline model and several “flavours” of the XGBoost regressor each with a different loss function.

Competitors are asked to predict the total cases of dengue fever over weekly time intervals in Iquitos and San Juan. Each competitor is ranked according to the mean absolute error (MAE) their model(s) scores against the test data set. To learn more about the challenge, dengue fever or enter the competition yourself, you can visit the challenge homepage.

Notebooks & Repositories

I have made my working Jupyter notebook and GitHub repo available to you via the links below. The notebook may take a few minutes to load, please be patient.

📒 Jupyter Notebook

📁Github Repo

Please feel free to experiment with your own loss function within the notebook.

The data comes in csv format with training features, labels and test features (similar to the Kaggle competition format). Data features include weather forecast, climate forecasts, and vegetation indices.

The label (total cases of dengue) is long tailed having a small number of extreme values. This holds across both cities. The shape of these distributions should give you a clue about what might be appropriate loss functions for the model.

The data used in this project is curated and provided by Driven Data, it is freely available for use outside of the competition according to Driven Data¹. The original source of the data is available from the National Oceanic and Atmospheric Administration, the data is publicly available and may be used without charge accournding to the the terms and conditions of the NOAA.

Image by Author: Distribution of dengue fever total cases Iquitos

Image by Author: Distribution of dengue fever total cases San Juan

Outside of splitting the data I performed three data pre-processing steps. The first, imputing missing values. I simply imputed the mean value across these missing features as a quick and dirty way to address this without causing too much distributional shift.

The second was to standardise the data. Standardisation is a form of feature scaling that sets your numeric features to have a variance of one and a mean of zero. Scaling isn’t necessary for XGBoost, however it is useful to scale features before fitting the Ridge regression model.

The third operation I applied was one-hot-encoding. This converts categorical variables to their numeric representations such that they can be used to train a model.

Note: to prevent data leakage all preprocessing was done on the train, validation, and test sets separately.

Feature Engineering

I engineered some lagged features. These can be particularly useful for improving the predictive performance of time series models due to autocorrelation.

There were some limitations in the data putting constraints on the lags I could create. First off, the test data was 3 years ahead of the training data with no labels. The implication of this is that the shortest lag I could create was 3 years, which feels quite long. Next, the number of cases were not uniformly recorded at the week level. I had to overcome this by aggregating the number of cases at the month level and creating my lagged features based on the aggregations. Lagged features were engineered as the mean, minimum and maximum.

I find it easier to express my data processing steps in code so I apologise in advance for the lengthy data pre-processing script.

Class for pre-processing the data.

Below I applied the data pre-processing prior to performing the ridge regression. Note that for tree based methods data scaling is not required. See notebook for further details.

Data pre=processing applied before ridge regression

There are two separate time series tracking total cases of dengue fever in weekly intervals. One time series is for Iquitos (Iq) and the other is for San Juan (Sj), I have trained separate models for each.

It helps me to conceptualise machine learning as an optimisation problem in which you are trying to minimise an objective function.

Objective Function = Loss Function + Regularisation Function

The loss function dictates how to ‘score’ the overall performance of the model in predicting the label, which in this case is the total number of dengue cases. The regularisation function penalises model complexity helping to mitigate overfitting.

Data Splitting

Before I go into the details of hyperparameter tuning, I should briefly talk about data splitting. A random split isn’t appropriate for time series problems. Splitting data randomly would cause future data points to be included in your test set giving false (and likely overoptimistic) performance results. I have split data by date to prevent any leakage and set up accurate validation.

For Iquitos the train data set is all observations prior to 30–09–2008. For San Juan the split was placed at 30–07–2004.
To further mitigate overfitting, I used time series cross validation with 5 folds while training the model. Read this for more detail on time series cross validation.

Hyperparameter Tuning

I used random search over a uniform distribution to tune my regularisation parameter. I find this to be a prudent approach to quickly explore a large hyperparameter space. Read this for more details on randomised search.

Note that I have tuned l2 regularisation only. This is alpha and lambda for Ridge and XGBoost respectively.

If you would like to experiment with regularisation parameters to see how they impact model fit, check out this stream lit app I made earlier.

Image by Author: An example of the hyperparameter tuning process on the Ridge model for Iquitos. Mean test score is MAE

My choice of loss functions were Poisson, Mean Absolute Error (MAE), and Mean Squared Error (MSE).

Poisson

For dengue fever transmission the independence assumption makes sense because cases are not transferred from person to person. However, the constant rate assumption is probably not right for this case study, I would imagine the rate of occurrence of dengue cases varies significantly based on multidimensional factors. Poisson distributions however do take an average rate as a model parameter.

A Poisson loss function is used when the target variable is count data that follows a Poisson distribution. This type of distribution assumes events occur independently and at a constant rate.

Mean Absolute Error

The mean absolute error (MAE) simply measures the absolute difference between the predicted and actual values. MAE is less sensitive to outliers than some of the other loss functions, particularly MSE.

I think many people confuse the loss functions with the scoring metric and presume that the best loss function is MAE because this is what the tournament is being scored on, you’ll see that this isn’t necessarily the case.

Means Squared Error

Mean squared error (MSE) calculates the square difference between the modelled output and the expected output. MSE is sensitive to outliers because of the squaring of the differences. It penalises large differences between the predicted and actual variables more than small ones.

Model performance is assessed on the validation set. Scoring (not to be confused with loss function) is MAE based on the scoring in the competition and ease of interpretability (simply measure in dengue cases). You will see from the charts below how the choice of loss function has drastically impacted how each model fits the data. As you analyse these charts, think about how the choice of loss function has caused the model to fit the data the way it does.

Ridge Regressor

Iquitos

Iquitos Model MAE: 7

The model captures some seasonality but doesn’t effectively predict the outliers.

Image by Author: Ridge regression model forecasts vs actuals for Iquitos

San Juan

San Juan Model MAE: 23.7

The model can capture some seasonality but doesn’t effectively predict the outliers. There is an offset in the Ridge predictions being slightly higher than actual cases.

Image by Author: Ridge regression model forecasts vs actuals for San Juan

XGBoost Regressor: Poisson Loss

Iquitos

Iquitos Model MAE: 7.8

The model captures some seasonality but doesn’t effectively predict the outliers. Where the model does predict spikes, they are underpredicted.

Image by Author: Poisson XGBoost model forecasts vs actuals for Iquitos

San Juan

San Juan Model MAE: 18.9

The model appears to predict spikes well on the training data. The model attempts this for the validation data but can overshoot. Training with a Poisson loss has allowed the model to try an anticipate spikes.

Image by Author: Poisson XGBoost model forecasts vs actuals for San Juan

XGBoost Regressor: MAE Loss

Iquitos

Iquitos Model MAE: 7

Although the MAE is low, the model appears to just be drawing a straight line through the data. It is not sensitive at all to outliers and seasonality.

Image by Author: MAE XGBoost model forecasts vs actuals for Iquitos

San Juan

San Juan Model MAE: 16.6

The model appears to be effective predicting seasonality but failing to predict the spikes, although it does make attempts at doing this.

Image by Author: MAE XGBoost model forecasts vs actuals for San Juan

XGBoost Regressor: MSE Loss

Iquitos

Iquitos Model MAE: 7.3

The model is able to capture some seasonality, however it fails at capturing the outliers.

Image by Author: MSE XGBoost model forecasts vs actuals for Iquitos

San Juan

San Juan Model MAE: 18.48

The model captures seasonality but does not capture the outliers particularly well.

Image by Author: MSE XGBoost model forecasts vs actuals for San Juan

The Poisson model scored the highest overall accuracy on the test data with an MAE of 27.6, the MSE model was next at 27.8, and last was the MAE model at 29.

Poisson loss XGBoost placed in the top 27% of competition entrances. Not bad for minimal hyperparameter tuning and feature engineering.

Choosing the best model completely depends on the objectives of the forecast. Strictly speaking, if we were just talking about the model that minimises MAE on the validation set then the best overall would be the XGBoost Regressor with MAE loss. However, the model that demonstrates the best ability to model the underlying phenomena appears to be the Poisson loss variant.

Thanks for reading.

[1] Bull, P., Slavitt, I. and Lipstein, G. (2016). Harnessing the Power of the Crowd to Increase Capacity for Data Science in the Social Sector. [online] Available at: https://arxiv.org/abs/1606.07781 [Accessed 13 Mar. 2023].

Mastering Time Series Forecasting with Machine Learning

How your choice of loss function can make or break your time series forecasts

I have built a Ridge regressor as my baseline model and several “flavours” of the XGBoost regressor each with a different loss function.

Notebooks & Repositories

I have made my working Jupyter notebook and GitHub repo available to you via the links below. The notebook may take a few minutes to load, please be patient.

📒 Jupyter Notebook

📁Github Repo

Please feel free to experiment with your own loss function within the notebook.

The third operation I applied was one-hot-encoding. This converts categorical variables to their numeric representations such that they can be used to train a model.

Note: to prevent data leakage all preprocessing was done on the train, validation, and test sets separately.

Feature Engineering

I engineered some lagged features. These can be particularly useful for improving the predictive performance of time series models due to autocorrelation.

I find it easier to express my data processing steps in code so I apologise in advance for the lengthy data pre-processing script.

Class for pre-processing the data.

Below I applied the data pre-processing prior to performing the ridge regression. Note that for tree based methods data scaling is not required. See notebook for further details.

Data pre=processing applied before ridge regression

It helps me to conceptualise machine learning as an optimisation problem in which you are trying to minimise an objective function.

Objective Function = Loss Function + Regularisation Function

Data Splitting

For Iquitos the train data set is all observations prior to 30–09–2008. For San Juan the split was placed at 30–07–2004.
To further mitigate overfitting, I used time series cross validation with 5 folds while training the model. Read this for more detail on time series cross validation.

Hyperparameter Tuning

Note that I have tuned l2 regularisation only. This is alpha and lambda for Ridge and XGBoost respectively.

If you would like to experiment with regularisation parameters to see how they impact model fit, check out this stream lit app I made earlier.

My choice of loss functions were Poisson, Mean Absolute Error (MAE), and Mean Squared Error (MSE).

Poisson

A Poisson loss function is used when the target variable is count data that follows a Poisson distribution. This type of distribution assumes events occur independently and at a constant rate.

Mean Absolute Error

I think many people confuse the loss functions with the scoring metric and presume that the best loss function is MAE because this is what the tournament is being scored on, you’ll see that this isn’t necessarily the case.

Means Squared Error

Ridge Regressor

Iquitos

Iquitos Model MAE: 7

The model captures some seasonality but doesn’t effectively predict the outliers.

San Juan

San Juan Model MAE: 23.7

The model can capture some seasonality but doesn’t effectively predict the outliers. There is an offset in the Ridge predictions being slightly higher than actual cases.

XGBoost Regressor: Poisson Loss

Iquitos

Iquitos Model MAE: 7.8

The model captures some seasonality but doesn’t effectively predict the outliers. Where the model does predict spikes, they are underpredicted.

San Juan

San Juan Model MAE: 18.9

XGBoost Regressor: MAE Loss

Iquitos

Iquitos Model MAE: 7

Although the MAE is low, the model appears to just be drawing a straight line through the data. It is not sensitive at all to outliers and seasonality.

San Juan

San Juan Model MAE: 16.6

The model appears to be effective predicting seasonality but failing to predict the spikes, although it does make attempts at doing this.

XGBoost Regressor: MSE Loss

Iquitos

Iquitos Model MAE: 7.3

The model is able to capture some seasonality, however it fails at capturing the outliers.

San Juan

San Juan Model MAE: 18.48

The model captures seasonality but does not capture the outliers particularly well.

The Poisson model scored the highest overall accuracy on the test data with an MAE of 27.6, the MSE model was next at 27.8, and last was the MAE model at 29.

Poisson loss XGBoost placed in the top 27% of competition entrances. Not bad for minimal hyperparameter tuning and feature engineering.

Thanks for reading.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.