How to Handle Missing Data in Medical Time Series Studies | by Eileen Pangu | Jun, 2022

By Jessie Hobb On Jun 16, 2022

Simple and effective methods — designed for recurrent neural networks — that have stood the test of extensive academic evaluations

Image source: pixabay.com

Background

A great deal of medical data is, by nature, time series data — cardiogram, temperature monitoring, blood pressure monitoring, regular nurse checkups, and so many more. There is a sea of valuable information embedded in the trends, patterns, spikes and dips of those medical charts, waiting to be uncovered. The healthcare industry calls for effective analysis of the medical time series data, which is believed to hold the key to improving care quality, optimizing the resource utilization, and reducing the overall healthcare cost.

One promising form of medical time series analysis is via recurrent neural network (RNN). RNN has become popular with medical researchers in recent years due to its modeling power and its ability to consume variable length input sequences. Researchers typically divide the time series data into even time-steps, such as 1 hour / time-step or 1 day / time-step. All the data points within a time-step will be aggregated via averaging or other aggregation schemes. This has two advantages. First, it reduces the length of the time series data sequence. Second, it normalizes the temporal context considering the fact that the original raw data points are usually irregularly spaced in time. After this preprocessing step, the data is almost in good shape for RNN consumption. But there is a unanswered question: what if there is no data in a given time-step?

The above question is significant in medical settings because missing medical data is often not missing at random. The missingness of the data itself carries clinical meaning. For example, the hospital staff may stop taking the temperature of a patient who is believed to have stabilized. Or maybe the patient’s situation warrants a different kind of measurement which supersedes the previous measuring method. Therefore, the usual zero filling or imputation approaches tend to generate suboptimal performance.

In this blog post, we’ll review 3 simple approaches to handle missing medical data in time series studies for use with RNN. The latter approaches build on the former ones with an increased level of sophistication. Thus, it’s highly recommended to read them in the order in which they are presented.

Simple Missingness Encoding

Let’s assume the input variable at each time-step is x accompanied by the subscript t. The variable has d dimensions, denoted by superscript d. An illustration of the input is shown in Figure 1 (a) with the simplification that d=1. The dark shaded elements are the absent data. We apply forward imputation to fill them with their most recent observed values. Forward imputation makes sense because often hospital staff stop taking further measurements of a metric once they believe it is stabilized, in which case the most recent observed values can be carried forward as a proxy for future actual values.

The simple missingness encoding approach, as proposed by this research paper, suggests that we should explicitly encode the fact that a given data point is in fact imputed rather than actually observed. This explicit encoding provides a signal to the RNN to take into account the absence of data. An illustration is shown in Figure 1 (b) where m denotes the missingness of x where 1 means present and 0 means absence (as defined in Formula 1). The input is a concatenation of both x and m.

Figure 1: encoding missingness. Darker colors are where the values are missing and thus have to be imputed.

This approach has yielded meaningful improvement in the experiments presented in that research paper. Your mileage may vary depending on your dataset. But this is a very straightforward and intuitive idea that’s worth trying out.

Temporal Distance Encoding

To build on the above approach, this paper proposes to explicitly encode the temporal distance a value has from the most recent observed value, in addition to the explicitly encoding of missingness. The input now is the concatenation of all three value encodings, the input x, the missingness signal m, and the temporal distance value δ. See Formula 2 and Figure 2 for an illustration.

Figure 2: encoding missingness and temporal distance. Darker colors are where the values are missing and thus have to be imputed.

The approach brought good improvement on top of the explicit encoding of missingness according to the experiments presented in the paper.

Introducing Decay

To again build on the above approaches, the same paper proposes a decay mechanism to the imputed values. Recall that we applied forward imputation to carry over the most recent observed values. But what if the missing period is extended? Should we carry over those old observed values indefinitely? If we think about the real world scenario for a bit: the hospital staff stop keeping track of a metric because they believe it’s stabilized. The metric value may still be somewhat on the far end of the normal range but it’s believed that it will eventually go back to a good median. What it means is that, in the absence of the observed data, we have a good reason to believe that the current metric value will linger around for some time but eventually “decay” back to a good medical default.

The decay factor γ is determined by a weight matrix W and a bias b, applied to the temporal distance δ (see Formula 3), and then fed to an inverted exp function capped at 1. W and b are shared across time-steps and are learned jointly during the training.

Formula 3: decay factor.

Formally, at any given time-step t, if x is observed, we used x. Otherwise, we use the value from the last observation at t’ with decay to an empirical mean of x. See Formula 4 for the final input to the RNN.

Formula 4: final input to RNN.

The paper also applies a similar decay mechanism to the hidden state of their RNN models, which produces the best result. But since we’re talking about processing of the raw data for RNN input, we won’t dive into that.

Conclusion

In this blog post, we introduced the background of medical time series data studies. We presented 3 simple approaches, designed for RNNs, that aim at explicitly encoding the clinical significance of missing data as the input to the model, which, according to the proposing papers, can produce superior results. Hopefully this is a quick read and it has sparked some good ideas for your own analysis projects.