Applied Time Series Analysis of Bike Accidents in Madrid | by Antonio Ramos | Nov, 2022

By Jessie Hobb On Nov 25, 2022

You will be safe cycling home

Bike lane — Photo by Markus Spiske on Unsplash

I recently came across a dataset containing registered accidents involving a bicycle, and as a data scientist who enjoys an occasional ride around my city this seemed like a good opportunity to have fun with this data.

To be more precise, the data forms a time series ranging from 2010 to 2018 with accidents registered chronologically by the Police, and is openly supplied by the City of Madrid in the link below. Several variables are also available such as location, weather or road conditions, but we will leave those out for other type of analysis and focus on the time series.

By analyzing this dataset we can provide insights to the City of Madrid to understand the dangers for bicycle users and eventually make it a safer city for them.

We are facing a particular type of time series called intermittent time series where a significant proportion of the values are zero, often as in this case the data comes from counts.

Last year of series, values ranging from 0 to 8

These series present additional complexity since many of the most commonly used Time Series Analysis methods assume a constant or a non-zero variable.

For this reason we will be using Croston’s method, this method was devised by J. D. Croston in 1970 to forecast demand of intermittent stock and to this day is still heavily used, we will dive deeper into it as the blog moves on.

To improve our understanding of the data we will start off with a time series decomposition.

Based on the nature of the data, there are two types of seasonality that we would be interested in:

Yearly seasonality. We can expect a cyclic pattern repeating on a yearly basis due to the influence of good and bad weather.
Weekly seasonality. We might observe a different pattern influenced by working days and weekends that could affect the number of bicycle users, their routes and other behavior. This could lead to a better understanding of cyclists’ behavior and spark further analysis, for example do we need to pay more attention to cyclists during workdays because most accidents take place on the road to work?

Yearly decomposition

To reduce the complexity of dealing with an intermittent series we aggregate the series by monthly frequency, this also gives us a smoother and less noisy overview.

We use an STL decomposition, Seasonal and Trend decomposition using LOESS, this is a robust method to perform time series decomposition, it also allows the seasonal component to evolve over time. Underneath, it uses LOESS, a non-parametric method using polynomial regression to fit a smooth curve to a target variable.

STL decomposition into Trend, Seasonal and Residuals

Zoomed-in view of seasonal component [May-October]

We can observe a growing trend in accidents that has flattened for the last three years which can be a positive sign if our goal is to make the city safer for cyclists.
The seasonal component shows a similar pattern year after year with an interesting shape, peaking in June/July and September and dropping in August. We can hypothesize that during summer good weather favors bicycle usage and perhaps the explanation for the trough in August is that many locals leave Madrid during that month to go on holidays. This insight can be useful for the City of Madrid to focus their budget for road safety campaigns during this time of the year.

Weekly decomposition

In this case, we are not able to use the same trick as we did before by aggregating the series, so we will decompose it using Croston’s method, we will take a sample of the last 8 weeks of the series since this should be enough to appreciate any clear patterns.

Croston’s method creates two new series:

q, called the non-zero demand, or in this case the time periods where accidents happened.
a, called the inter-arrival time, or the interval between two time periods where accidents happened.

Croston’s decomposition of daily series using the last 8 weeks

In the first plot showing the days with accidents, we do not appreciate the pattern of working days against weekends that we anticipated, or any other obvious weekly pattern.
The second plot showing us the intervals of days between accidents looks a bit flat since most of the days accidents take place, and the peaks denoting an interval look sparse and irregular.

Based on these insights we will not assume a weekly seasonality, but it would be interesting to drill down into other variables in the data to make the analysis more insightful.

The analysis of past outliers can be useful to highlight an extraordinary spike in bike accidents and for example evidence a combination of an special time event with a problematic location and/or weather conditions, this information can be used by the authorities to enforce exceptional mitigation actions preventing future accidents.

Using the previous monthly decomposition, we search for outliers by looking into the distribution of the residuals.

We declare as an outlier any residual that is above 1.5 times the IQR, this is 1.5 times the range between the 25 and the 75 percentiles of the distribution.

This results in a single outlier in June 2021, however, in this case I can not find through other variables in the data or other news sources anything abnormal like a spell of bad weather, or a special event held in Madrid during that period that could shed more light on this outlier.

Forecasting can be used by the authorities to anticipate sufficient resources to support the volume of forecasted accidents, operationally this could be even more effective by targeting smaller portions of the city, for example by having a forecast for each of the regions where a Police station operates.

We will continue using Croston’s method. Rather than predicting when exactly an accident will happen or not, Croston’s estimates for each period the demand and the intervals without demand and predicts an averaged demand for a period. In our case we do not need to know precisely if tomorrow there will be zero, one or more accidents, Police staffing support is probably decided at most on a weekly or monthly basis, so working with an average demand is fine.

We pick a test dataset consisting of two weeks and obtain the below results.

Since we have intermittent data that includes zero values we have to be specially careful when selecting the metric to measure the model performance.

We could use the mean absolute error, MAE, however if for example we want to develop a weekly model we would not be able to fairly compare their performance because the metric will be dependent on the scale of the data.
Mean absolute percentage error, MAPE, offers a relative metric, however due to the intermittent data we would incur in a division by zero.
Mean Absolute Scaled Error, MASE, circumvents these issues, it is a robust metric for intermittent series and it gives us a scale-free metric based on the ratio between the prediction error and the error from a naive model that predicts the previous timestamp.

Our prediction is 1.33 for the two weeks, and the model obtains a MAE of 0.92 and a MASE of 0.65. This can be interpreted as on average the model is almost off by 1 accident due to over or underestimation, and the MASE tells us that the model outperforms the baseline of a naive prediction.

I encourage the reader to explore other Croston’s variants like SBA or TSB to improve these results.

In this article we have looked into bike accidents data with the goal of making Madrid a safer city to cyclists, and identified some use cases where time series analysis can be helpful by focusing resources in key seasonal times of the year, monitoring outliers to adopt exceptional measures, or by forecasting the volume of accidents to ensure adequate emergency support.

As for next steps, I believe this dataset has more potential by analyzing other variables such as location. Geospatial analysis can give away problematic locations and evidence issues such as a lack of traffic signs or insufficient bike lanes and infrastructures, this can be combined with Time Series Analysis to monitor over time the effectiveness of the adopted measures. Additionally, I would suggest targeting different behaviors separately, such as commuters vs leisure cyclists.

Thanks for reading, you can find all the the code behind this article in the repository below.