Bi-LSTM+Attention for Modeling EHR Data | by Satyam Kumar | Jan, 2023

By Jessie Hobb On Feb 3, 2023

Essential guide to the diagnosis prediction in healthcare via attention-based Bi-LSTM network

Predicting future health information or disease using the Electronic Health Record (EHR) is a key use case for research in the healthcare domain. EHR data consists of diagnosis codes, pharmacy codes, and procedure codes. Modeling and Interpreting the model on EHR data is a tedious task due to the high dimensionality of the data.

In this article, we will discuss a popular research paper, DIPOLE, published in June 2019, which uses the Bi-LSTM+Attention network.

Contents:
1) Limitations of Linear/Tree-based for Modeling
2) Why RNN models?
3) Essential guide to LSTM & Bi-LSTM network
4) Essential guide to Attention
5) Implementation

The previous implementation was a Random Forest model with a fixed set of hyperparameters to model the aggregate member-level claims/pharmacy/demographics features.

In the case of disease prediction, the output depends on the sequence of events over time. This time sequence information gets lost in the RF model. So the idea is to try time-series model-based event prediction. Candidates can be statistical time series models like ARIMA, Holt-Winters, or Neural Network-based models like RNN/LSTMs or even transformer-based architectures.
However, the long-term dependencies of the events and information in (irregular) time intervals between events are difficult to capture in an RF model or even in classical time-series models.
Further, the Random Forest was not able to capture the non-linear associations & complex relationships between time-ordered events. This is also the case with classical TS models. We can introduce non-linearity by including interaction terms (like quadratic, multiplicative, etc) or using kernels (like in SVM), however, that depends on us knowing the actual non-linear dependencies which in current age real data is very difficult to find out.

As such we move ahead with exploring neural network-based time series models like RNN/LSTMs first and later with transformer architectures. The above hypothesis about limitations of RF & classical TS models would also be verified later from their evaluation metrics comparison with RNN/LSTM models.

Claims data includes information related to diagnoses, procedures, and utilization for each member at the claim level. The claim information is time-based and the existing RF model does not utilize the time information of the visits.

The idea is to update the RF model with something more suitable for time-series event prediction like RNN.

The input of each RNN unit is dependent on the output of the previous unit as well as the input sequence at time ‘t’. An RNN unit repeats itself for each event in the sequence.

Limitation of the RNN model:

RNNs have been found to be working well in practice for short-term dependencies in data. For example, a model predicts the next word of the incomplete sentence using the existing words. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context — it’s pretty obvious the next word is going to be the sky. In such cases, where the gap between the relevant information and the place where it’s needed is small, RNNs can learn to use past information.

But for long sequences, the RNN model suffers from a vanishing/exploding gradient problem, which hampers the long-term learning of the events. During backpropagation, the gradient becomes smaller and smaller, and the parameter updates become insignificant for early events which means no real learning is done.

RNNs also become slow to train over such long sequence data.

Alternatives:

LSTM (Long Short Term Memory), and GRU (Gated Recurrent Unit) are alternatives or updated versions of the RNN network that are capable to capture the long-term dependency of the sequential events without having the vanishing/exploding gradient problem for most cases. They overcome this problem of RNN by having a selective retention mechanism through multiple weights & biases instead of one.

An LSTM unit has 3 gates (Input, Output, and Forget Gate) to protect and control the cell state and add necessary information to the current state. There are 3 inputs to an LSTM unit i.e. previous cell state (C_(t-1)), previous unit output (h_(t-1)), and input event at the time ‘t’ (x_t). Whereas it has two outputs i.e. current cell state (C_t), and current output (h_t).

Please visit colah’s blog to get an understanding of how an LSTM network works under the hood:

Bi-LSTM Network:

Bi-LSTM is a variation of LSTM that flows input in both the direction to preserve future and past information.

The forward LSTM reads the input visit sequence from x_1 to x_t and calculates a sequence of forward hidden states. The backward LSTM reads the visit sequence in the reverse order, i.e., from x_t to x_1, resulting in a sequence of backward hidden states. By concatenating the forward hidden state and the backward one, we can obtain the final latent vector representation as h_i.

Bi-LSTM layer is used instead of LSTM layer as:

to capture the full context of the sequence
often times things only get clear in hindsight when looking at future

Drawbacks of Bi-LSTM:

they are learning left-to-right and right-to-left contexts separately and concatenating them. Thus true context is lost in some sense lost.
assume each event to be equally time-spaced
are sequential in nature so may get slow to train with big data.

Transformers can overcome the above drawbacks.

The attention mechanism was introduced by Bahdanau et al. (2014) paper, to address the bottleneck problem that arises when the decoder would have limited access to the information provided by the input.

Attention models enable the network to focus on a few particular aspects/events at a time and ignore the rest.

In our implementation, we are adding an attention layer to capture the importance of a visit vector to make any prediction. The larger an attention score corresponding to a visit vector, the more significance it has while making a prediction.

Our current Bi-LSTM implementation is inspired by the DIPOLE paper (June 2017) by Fenglong Ma and the team. The paper employs a Bi-LSTM network to model EHR data and utilizes a simple attention mechanism to interpret the results.

I have discussed below the step-by-step approach to our current implementation:

a) Data:

We are using data (Electronic Health Data — EHR) data for Bi-LSTM+Attention modeling.

Note: This is the same data we were using for Linear/Tree modeling.

As of now, we are restricted to the model only on the diagnosis medical codes.

(Image by Author), Snapshot of raw EHR data

b) Feature Engineering (EHR Data):

The claims data is line level, we select the first record for each claim to make it to the visit level.
Prepare a diagnosis label encoder for all the unique diagnosis codes in the training data.

One hot encode the diagnosis code for each visit.
Select the last x visits/claims (hyperparameter) for each member. If a member does not threshold the number of visits, we pad the remaining visits with a zero vector.

Format the data suitable for LSTM network input (Members, Visits, unique medical codes)

For a dataset with 1000 members having 5000 unique medical code, and padding the visit to 10, we will have the final training data shape as: (1000, 10, 5000)

c) Bi-LSTM + Attention in Keras

We will use tf.Keras framework to implement the Bi-LSTM + Attention network to model the disease prediction.

(Code by Author)

d) Interpretation

In healthcare, the interpretability of the learned representations of medical codes and visits is important. We need to understand the clinical meaning of each dimension of medical code representations and analyze which visits are crucial to the prediction.

Since the proposed model is based on attention mechanisms, we can find the importance of each visit for prediction by analyzing the attention scores.

f) Model Summary

DIPOLE implementation uses a Bi-LSTM architecture network, to capture the long-term and short-term dependencies of the historical EHR data. The Attention mechanism can be used to interpret the prediction results, learned medical codes, and visit-level information.

According to the authors of the paper, DIPOLE can significantly improve performance compared to the traditional state-of-the-art diagnosis prediction approaches.