Forecast Time Series with Missing Values: Beyond Linear Interpolation | by Marco Cerliani | Oct, 2022

By Jessie Hobb On Oct 13, 2022

Comparing Alternatives to Handle Missing Values in Time Series

Having at disposal clean and easily understandable data is a dream for every data scientist. Unfortunately, the reality it’s not so sweet. We have to spend a good part of our time committed to carrying out data exploration and cleaning. However, a good explorative analysis it’s the key to extracting to most useful insights and producing better outcomes.

In the context of a predictive application, a detailed overview of the dynamics present in the data is a good starting point for taking the best decisions. From the choice of the predictive architecture to the preprocessing techniques, there are a lot of available alternatives. One of the most important, and at the same time undervalued, is the method used to handle the missing values.

Missing observations haven’t all the same meaning. May be missing values due to the absence of information or also due to problems in the ingestion process. In most cases, there is no golden rule valid for all the situations to fill in missing values. What we can do is understand the field of analysis. With time series, we have to take into account the correlation dynamics in the system and the temporal dependencies present in the data.

In this post, we try to solve a time series forecasting task with the presence of missing values. We investigate different strategies to handle missing observations for time series. From standard linear interpolation to more sophisticated techniques, we try to compare different available methodologies at our disposal. The exciting part consists in carrying out our experiments managing only scikit-learn. Forecasting time series with simple scikit-learn is possible with tspiral. I released tspiral with the purpose to leverage the completeness and the accessibility of the scikit-learn ecosystem to solve also time series forecasting tasks.

Our scope is to test how different imputation strategies affect the performance of time series forecasting. For this purpose, we first generate some hourly synthetic time series with daily and weekly seasonalities.

Seasonality patterns of simulated time series (image by the author)

Secondly, we artificially generate some missing intervals and insert them into our time series.

Example of time series with missing values (image by the author)

At this point, we are ready to start modeling. We want to test how forecasting accuracy changes according to the methodology used to fill in the missing values. Apart from the well know linear interpolation, we would like to test how techniques, always applied to tabular datasets, behave also with time series. Specifically, we test k-nearest neighbors (knn) and iterative imputation.

With the knn methodology, we fill in missing values using the k-nearest neighbors’ approach. Each missing feature is imputed using values from the nearest neighbor features. The neighbors for a sample, with more than one feature missing, may differ depending on which feature has to be imputed. Another interesting approach consists in using an iterative imputation. Each feature with missing values is modeled as a function of other features. In this way, we fit a model to predict each feature treating the others as inputs. The resulting predictions are used to estimate the missing observations and provide imputations. The estimation procedure can be repeated more times to grant robustness until the reconstruction error is low enough.

In a time series context, we apply the same techniques directly to the lagged target features maintaining unchanged the underlying algorithm and the forecasting strategy.

Reconstruction comparison of different imputation strategies (image by the author)

Focusing on the reconstruction ability, iterative and knn imputations look very promising. With a simple interpolation, we are limited to connecting the closer observations without taking the nature of the system into account. Using a knn or an iterative imputation, we can replicate the seasonality patterns and the underlying dynamics present in the data. By adopting these preprocessing techniques in our forecasting pipeline, we may improve the imputation ability resulting in better forecasting.

Forecast comparison of different imputation strategies (image by the author)

Incorporating missing value filling in a machine learning forecasting pipeline is straightforward. We simply have to choose the desired imputation algorithm and stacked it on top of the prediction algorithm selected. The imputation is carried out on the lagged target and it’s useful to fit a predictive algorithm on a complete feature set without missing values. Below is an example of iterative imputation (made with linear models) with a recursive forecasting approach.

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.impute import IterativeImputer 
from tsprial.forecasting import ForecastingCascademodel = ForecastingCascade(
make_pipeline(
IterativeImputer(Ridge(), max_iter=10), 
Ridge()
),
lags=range(1,169),
use_exog=False,
accept_nan=True
)
model.fit(None, y)
pred = model.predict(np.arange(168))

We compare the three mentioned filling techniques (linear interpolation, knn, and iterative imputation) on our synthetic time series. The missing intervals have a random length and are inserted randomly in the final parts of our time series. This choice wants to test simultaneously the reconstruction ability and the impact of the imputation on forecasting future values.

Performance comparison of different imputation strategies (image by the author)

As expected, the results show a positive impact of knn and iterative imputer on the performances computed on the test data. They can catch the seasonality behaviors in the data and provide a better reconstruction which results in better forecasting.

In this post, we introduced valid alternatives to linear interpolation to deal with missing values in a time series scenario. We discovered how to incorporate our custom imputation strategy into our forecasting pipeline simply using scikit-learn and tspiral. In the end, we found that the proposed techniques can produce performance boosts if properly adopted and validated according to the case of analysis.

Comparing Alternatives to Handle Missing Values in Time Series

Secondly, we artificially generate some missing intervals and insert them into our time series.

In a time series context, we apply the same techniques directly to the lagged target features maintaining unchanged the underlying algorithm and the forecasting strategy.

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.impute import IterativeImputer 
from tsprial.forecasting import ForecastingCascademodel = ForecastingCascade(
make_pipeline(
IterativeImputer(Ridge(), max_iter=10), 
Ridge()
),
lags=range(1,169),
use_exog=False,
accept_nan=True
)
model.fit(None, y)
pred = model.predict(np.arange(168))

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.