Techno Blender
Digitally Yours.

Healthcare Predictive Analytics with GANs | by Sadrach Pierre, Ph.D. | Nov, 2022

0 62


Augmenting Imbalanced Healthcare Readmission Data with GANs

Image by Pixabay on Pexels

Generative adversarial networks (GANs) are a class of deep learning models developed by Ian Goodfellow and colleagues in 2014. At a high level, GANs are made up of two competing neural networks that make up a zero-sum game. This means that the gains of one neural network agent corresponds to a loss for the other agent. Specifically, the GAN algorithm trains a discriminator to distinguish between real and synthetic data while simultaneously training a generator to produce synthetic instances of data that can fool the discriminator.

GANs have a wide variety of applications including image generation, image to image translation, tabular data augmentation and much more. The data augmentation use case is interesting since it can be used to augment imbalanced data sets for outlier detection which have a wide variety of industry applications. For example, in the healthcare space data, augmentation with GANs can be used to improve machine learning models that predict patient readmission.

Patient hospital readmission corresponds to an event where a patient who was admitted to a hospital returns to the same or another hospital after a specific time interval. Patient readmission rates serve as a good metric for measuring quality of treatment and care. High readmission rates translates to poor patient outcomes and higher operating costs. In fact, the Affordable Healthcare Act (ACA) imposes a penalty on healthcare providers with high readmission rates. With that being said healthcare providers are incentivized to reduce readmission rate by providing higher quality care.

There are many factors that contribute to high readmission rates. This includes age, weight, geographic location, comorbidities, polypharmacy and much more. For example, older patients are more likely to be readmitted since they have more chronic illnesses and diseases. Weight, particularly obesity, also plays a large role since people who are overweight and obese are more likely to have heart disease, cancer and significant comorbidities in general.

Most current attempts at reducing readmission rates involve understanding ACA policy, identifying high risk patients, medication reconciliation, preventing healthcare acquired infections, and good hand-off communication. For example, ACA established the readmission reduction program incentivizing health providers to improve quality of care. Further, it is important to be able to identify high risk populations based on medical history and demographics. Medical reconciliation involves ensuring that a patient’s medication list is as up to date as possible. This can help reduce the risk of adverse drug related events due to taking multiple medications.

Patients who are on several medications (polypharmacy) also have a high likelihood of readmission. This is largely due to the comorbidities necessitating these medications. Polypharmacy also causes drug-drug interactions which may contribute to adverse events after discharge increasing the likelihood of readmission. Taking measures to prevent these types of infections can reduce likelihood of readmission. Finally, hands off communication is important as it encompasses clear documentation of the patient’s condition and plan for treatment. If any information about contributing factors such as polypharmacy or comorbidities are left out during hands off communication it can increase the likelihood of readmission.

In addition to monitoring these readmission factors manually, healthcare providers can utilize machine learning to help identify and target high risk populations. In the past logistic regression models trained on a variety of risk factors have been used to predict probability of readmission. These models can be used in concert with human inspection to take preventive measures to reduce readmission rates.

Given that healthcare data is highly sensitive and confidential, accessing data sources for modeling can be difficult. Despite these challenges there are many open-source tools available to creating synthetic data sets that realistically represent the factors needed for predictive modeling. Synthetic data has the benefit of being anonymous, free to use and share publicly. Despite the fact that the data is synthetic, with enough domain expertise a highly representative data set can be generated and applied in a variety of use cases.

Faker is an open source python tool that can be used for generating synthetic data across a wide variety of industry verticals including finance, retail and healthcare. We will look at how to use Faker to generate synthetic healthcare readmission data and use it for predictive modeling. We will include many of the factors known to contribute to readmission risk. We will then define, in a rule-based manner, readmission targets for our machine learning model based on high risk factors.

Given that readmission isn’t common we will model this data set as an imbalanced data set where most of the data is made up of patients who aren’t readmitted within the defined time frame. We will build a baseline catboost model which we will use to compare performance before and after data augmentation. We will then use tabular generative adversarial networks (tabgans) to augment our imbalanced data. Finally we will build a catboost a second classification model for predicting patient readmission on our augmented data.

For this work, I will be writing code in Deepnote, which is a collaborative data science notebook that makes running reproducible experiments very easy.

Generating Synthetic Data with Faker

To start, let’s navigate to Deepnote and create a new project (you can sign-up for free if you don’t already have an account).

Let’s create a project called ‘synthetic_data’ and a notebook within this project called ‘tabgan_experiment’:

Screenshot taken by Author

Let’s install the packages we will be using:

Embedding Created by Author

Next let’s import Faker and define a variable for the size of the synthetic data we will be generating. We will be generating a data set with 5000 rows:

Embedding Created by Author

Next let’s define a Faker object and store it in a variable called ‘fake’:

fake = Faker()

Let’s set a seed so that our results are reproducible and initialize a list called names which we will use to store our synthetic names:

Faker.seed(42)
names = []

Now let’s populate our list with synthetic names in a for-loop. To do this we call the name() method on our faker object to generate synthetic names:

for i in range(0,DATA_SIZE):
names.append(fake.name())

The full cell is as follows:

Embedding Created by Author

We can do something similar for US states by calling the state() method on the faker object in a for-loop:

Embedding Created by Author

For our example, we will consider the scenario of patient being readmitted to an emergency department:

Embedding Created by Author

We will also specify sex:

Embedding Created by Author

Next let’s import numpy and use random.normal to generate normal distributed ages with a mean of 50 years old and a standard deviation of 20 years:

Embedding Created by Author

Let’s also generate normally distributed values for weight with a mean of 180 lb and standard deviation of 70 lb:

Embedding Created by Author

And let’s do the same for height in inches:

Embedding Created by Author

We can also define a field that specifies whether a patient smokes:

Embedding Created by Author

And another that specifies how many days a patient stayed in the emergency room:

Embedding Created by Author

Next we will use the dynamic provider method to randomly select elements from an external list or source. Let’s import the dynamic provider:

from faker.providers import DynamicProvider

Let’s define a dynamic provider object with a list of health insurances:

health_insurance_provider = DynamicProvider(
provider_name="health_insurance",
elements=["UnitedHealth Group", "Anthem", "Aetna", "Cigna", "Humana", "Medicare"],
)

Now we will set a random seed for reproducibility and add our health insurance provider object to our faker object:

Faker.seed(42)
fake.add_provider(health_insurance_provider)

We can now append randomly selected insurances to our list of insurances:

insurance = []
for i in range(0, DATA_SIZE):
insurance.append(fake.health_insurance())

The full cell is as follows:

Embedding Created by Author

Next let’s use our lists to create a pandas dataframe:

Embedding Created by Author

Let’s also calculate BMI from our weight and height values:

Embedding Created by Author

Our goal is to create a data set that we can use to train a readmission classification model. Next thing we need to do is generate our target labels for readmission. Let’s start by generating some randomly assigned labels. This will serve as noise that we typically find in real data:

df['readmission'] = [np.random.randint(0,2) for x in range(0, DATA_SIZE)]

The names generated by the names method of the faker object are not necessarily unique, so let’s drop duplicate names:

df.drop_duplicates('name', inplace=True)

Next’s lets sample 20% of our resulting data and store it in a new data frame called df_sample1. This data frame will make up our noise values for negative and positive readmission outcomes:

df_sample1 = df.sample(frac=0.2, replace=True, random_state=1)

The remaining 80% of our data we will store in another data frame called df_sample2:

df_sample2 = df[~df['name'].isin(list(set(df_sample1['name'])))]

We will use df_sample 2 to define our ground truth labels. Readmission values will be 0 or 1. A readmission value of 0 means that the patient has not been readmitted after a specified time frame. A value of 1 means the patient has been readmitted. This is usually 30 days or 60 days, but for our purposes we don’t have to worry about the exact value since this is all synthetic data. Let’s initialize all readmission values in df_sample2 to 0

df_sample2['readmission'] = 0

Next let’s label all patients who have a BMI> 30, smoke, have Medicare and stayed in the emergency room >5 days as having a readmission vale of 1:

df_sample2.loc[(df_sample2.bmi >=30) & (df_sample2.smoker == 'Yes') & (df_sample2.insurance == 'Medicare') & (df.length_of_stay >= 5), 'readmission'] = 1

Finally let’s append df_sample1 to df_sample2 and store the result in a new variable df:

df = df_sample2.append(df_sample1)

The full cell is as follows:

Embedding Created by Author

Next, let’s use the counter method to look at the number of 0 and 1 readmission outcomes. As we can see the data is imbalanced with 517 readmissions values equal to 1 and 4405 readmission values equal to 0:

Embedding Created by Author

Building Catboost Classifier

Now we can start preparing our data for our classification model. Let’s convert our categorical columns to machine readable codes. While catboost can handle categorical variables directly, this will make using tabgan easier later on:

Embedding Created by Author

We can now define our input and output for modeling. We will use ‘insurance’, ‘sex’, ‘smoker’, ‘state’, ‘height’, ‘weight’, ‘bmi’, ‘length_of_stay’ to predict ‘readmission’:

Embedding Created by Author

Next let’s split our data for training and testing:

Embedding Created by Author

Now let’s import the catboost package:

Embedding Created by Author

We can now define our model object and fit it to our training data. For simplicity let’s use 10 iterations for our catboost model(which is the same as n_estimators):

Embedding Created by Author

Now let’s evaluate the performance of our model. Since our data is imbalanced precision is a good metric to use to measure how well the model predicts the underrepresented class. A precision value of 1.0 means the model has perfect precision:

Embedding Created by Author

We see that the model predicts all of the values in the test set to have readmission values of 0, even though there are 146 ground truth values of 1 in the test set. This corresponds to a precision value of 0. This is a typical issue when building classification models on imbalanced data. Ideally tabular GANs can help us increase model precision through data augmentation.

Augmenting data with Tabgan

Let’s start by importing tabgan:

from tabgan.sampler import GANGenerator

Let’s define our columns, our input and output, and reset indices:

cols = ['insurance', 'sex', 'smoker', 'state', 'height', 'weight', 'bmi', 'length_of_stay', 'readmission']X = df[cols[:-1]]
y = pd.DataFrame(list(df['readmission']), columns=['readmission'])
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)

Next we will split our data for training and testing:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

And finally generate our synthetic data with the GANGenerator. This code will generate a new data frame of inputs (new_train2) and outputs (new_target2) , where synthetic input data are appended to X_train and stored in new_train2, and synthetic target values are appended to y_train and stored in new_target2. If you are curious about the parameters involved with the GANGenerator and generate_data_pipe please see the tabgan documentation.

new_train2, new_target2 = GANGenerator(cat_cols = cols, epochs=2, is_post_process=False).generate_data_pipe(X_train, y_train, X_test,use_adversarial=False, only_generated_data=False)

The full cell is as follows:

Embedding Created by Author

We can now look at the distribution in 1s and 0s for readmission using the counter method:

We see that significantly more data has been generated. In our original data set we had 4405 readmission values equal to 0 and 517 readmission values equal to 1. Compare this to the screenshot above where we now have 8102 values equal to 0 and 2921 values equal to 1:

Embedding Created by Author

We can also display the new input data, new_train2:

Embedding Created by Author

Now let’s train a new catboost model on our augmented data:

Embedding Created by Author

And evaluate the performance:

Embedding Created by Author

We see that we have a slight improvement in precision. This can further be improved with hyperparameter tuning the GANGenerator. For example, you can try increasing the number of epochs from 2 to a larger number like 50 or 100.

The code used in this post can be found on GitHub.

Conclusions

Here we looked at how to use open-source python packages to consider the real healthcare use case of predicting patient readmission. First, we used the faker package to generate patient attributes that we subsequently used to train a readmission classification model. We then saw how to use the tabgan package to augment the training data we generated with synthetic values produced by our GANgenerator. We the saw that through this method of data augmentation we were able to improve the precision of our classification model. This method of data augmentation may also have use in healthcare prediction problems not covered here. This includes rare disease detection, adverse drug-drug interaction events, patient population risk classification for value-based care and much more.


Augmenting Imbalanced Healthcare Readmission Data with GANs

Image by Pixabay on Pexels

Generative adversarial networks (GANs) are a class of deep learning models developed by Ian Goodfellow and colleagues in 2014. At a high level, GANs are made up of two competing neural networks that make up a zero-sum game. This means that the gains of one neural network agent corresponds to a loss for the other agent. Specifically, the GAN algorithm trains a discriminator to distinguish between real and synthetic data while simultaneously training a generator to produce synthetic instances of data that can fool the discriminator.

GANs have a wide variety of applications including image generation, image to image translation, tabular data augmentation and much more. The data augmentation use case is interesting since it can be used to augment imbalanced data sets for outlier detection which have a wide variety of industry applications. For example, in the healthcare space data, augmentation with GANs can be used to improve machine learning models that predict patient readmission.

Patient hospital readmission corresponds to an event where a patient who was admitted to a hospital returns to the same or another hospital after a specific time interval. Patient readmission rates serve as a good metric for measuring quality of treatment and care. High readmission rates translates to poor patient outcomes and higher operating costs. In fact, the Affordable Healthcare Act (ACA) imposes a penalty on healthcare providers with high readmission rates. With that being said healthcare providers are incentivized to reduce readmission rate by providing higher quality care.

There are many factors that contribute to high readmission rates. This includes age, weight, geographic location, comorbidities, polypharmacy and much more. For example, older patients are more likely to be readmitted since they have more chronic illnesses and diseases. Weight, particularly obesity, also plays a large role since people who are overweight and obese are more likely to have heart disease, cancer and significant comorbidities in general.

Most current attempts at reducing readmission rates involve understanding ACA policy, identifying high risk patients, medication reconciliation, preventing healthcare acquired infections, and good hand-off communication. For example, ACA established the readmission reduction program incentivizing health providers to improve quality of care. Further, it is important to be able to identify high risk populations based on medical history and demographics. Medical reconciliation involves ensuring that a patient’s medication list is as up to date as possible. This can help reduce the risk of adverse drug related events due to taking multiple medications.

Patients who are on several medications (polypharmacy) also have a high likelihood of readmission. This is largely due to the comorbidities necessitating these medications. Polypharmacy also causes drug-drug interactions which may contribute to adverse events after discharge increasing the likelihood of readmission. Taking measures to prevent these types of infections can reduce likelihood of readmission. Finally, hands off communication is important as it encompasses clear documentation of the patient’s condition and plan for treatment. If any information about contributing factors such as polypharmacy or comorbidities are left out during hands off communication it can increase the likelihood of readmission.

In addition to monitoring these readmission factors manually, healthcare providers can utilize machine learning to help identify and target high risk populations. In the past logistic regression models trained on a variety of risk factors have been used to predict probability of readmission. These models can be used in concert with human inspection to take preventive measures to reduce readmission rates.

Given that healthcare data is highly sensitive and confidential, accessing data sources for modeling can be difficult. Despite these challenges there are many open-source tools available to creating synthetic data sets that realistically represent the factors needed for predictive modeling. Synthetic data has the benefit of being anonymous, free to use and share publicly. Despite the fact that the data is synthetic, with enough domain expertise a highly representative data set can be generated and applied in a variety of use cases.

Faker is an open source python tool that can be used for generating synthetic data across a wide variety of industry verticals including finance, retail and healthcare. We will look at how to use Faker to generate synthetic healthcare readmission data and use it for predictive modeling. We will include many of the factors known to contribute to readmission risk. We will then define, in a rule-based manner, readmission targets for our machine learning model based on high risk factors.

Given that readmission isn’t common we will model this data set as an imbalanced data set where most of the data is made up of patients who aren’t readmitted within the defined time frame. We will build a baseline catboost model which we will use to compare performance before and after data augmentation. We will then use tabular generative adversarial networks (tabgans) to augment our imbalanced data. Finally we will build a catboost a second classification model for predicting patient readmission on our augmented data.

For this work, I will be writing code in Deepnote, which is a collaborative data science notebook that makes running reproducible experiments very easy.

Generating Synthetic Data with Faker

To start, let’s navigate to Deepnote and create a new project (you can sign-up for free if you don’t already have an account).

Let’s create a project called ‘synthetic_data’ and a notebook within this project called ‘tabgan_experiment’:

Screenshot taken by Author

Let’s install the packages we will be using:

Embedding Created by Author

Next let’s import Faker and define a variable for the size of the synthetic data we will be generating. We will be generating a data set with 5000 rows:

Embedding Created by Author

Next let’s define a Faker object and store it in a variable called ‘fake’:

fake = Faker()

Let’s set a seed so that our results are reproducible and initialize a list called names which we will use to store our synthetic names:

Faker.seed(42)
names = []

Now let’s populate our list with synthetic names in a for-loop. To do this we call the name() method on our faker object to generate synthetic names:

for i in range(0,DATA_SIZE):
names.append(fake.name())

The full cell is as follows:

Embedding Created by Author

We can do something similar for US states by calling the state() method on the faker object in a for-loop:

Embedding Created by Author

For our example, we will consider the scenario of patient being readmitted to an emergency department:

Embedding Created by Author

We will also specify sex:

Embedding Created by Author

Next let’s import numpy and use random.normal to generate normal distributed ages with a mean of 50 years old and a standard deviation of 20 years:

Embedding Created by Author

Let’s also generate normally distributed values for weight with a mean of 180 lb and standard deviation of 70 lb:

Embedding Created by Author

And let’s do the same for height in inches:

Embedding Created by Author

We can also define a field that specifies whether a patient smokes:

Embedding Created by Author

And another that specifies how many days a patient stayed in the emergency room:

Embedding Created by Author

Next we will use the dynamic provider method to randomly select elements from an external list or source. Let’s import the dynamic provider:

from faker.providers import DynamicProvider

Let’s define a dynamic provider object with a list of health insurances:

health_insurance_provider = DynamicProvider(
provider_name="health_insurance",
elements=["UnitedHealth Group", "Anthem", "Aetna", "Cigna", "Humana", "Medicare"],
)

Now we will set a random seed for reproducibility and add our health insurance provider object to our faker object:

Faker.seed(42)
fake.add_provider(health_insurance_provider)

We can now append randomly selected insurances to our list of insurances:

insurance = []
for i in range(0, DATA_SIZE):
insurance.append(fake.health_insurance())

The full cell is as follows:

Embedding Created by Author

Next let’s use our lists to create a pandas dataframe:

Embedding Created by Author

Let’s also calculate BMI from our weight and height values:

Embedding Created by Author

Our goal is to create a data set that we can use to train a readmission classification model. Next thing we need to do is generate our target labels for readmission. Let’s start by generating some randomly assigned labels. This will serve as noise that we typically find in real data:

df['readmission'] = [np.random.randint(0,2) for x in range(0, DATA_SIZE)]

The names generated by the names method of the faker object are not necessarily unique, so let’s drop duplicate names:

df.drop_duplicates('name', inplace=True)

Next’s lets sample 20% of our resulting data and store it in a new data frame called df_sample1. This data frame will make up our noise values for negative and positive readmission outcomes:

df_sample1 = df.sample(frac=0.2, replace=True, random_state=1)

The remaining 80% of our data we will store in another data frame called df_sample2:

df_sample2 = df[~df['name'].isin(list(set(df_sample1['name'])))]

We will use df_sample 2 to define our ground truth labels. Readmission values will be 0 or 1. A readmission value of 0 means that the patient has not been readmitted after a specified time frame. A value of 1 means the patient has been readmitted. This is usually 30 days or 60 days, but for our purposes we don’t have to worry about the exact value since this is all synthetic data. Let’s initialize all readmission values in df_sample2 to 0

df_sample2['readmission'] = 0

Next let’s label all patients who have a BMI> 30, smoke, have Medicare and stayed in the emergency room >5 days as having a readmission vale of 1:

df_sample2.loc[(df_sample2.bmi >=30) & (df_sample2.smoker == 'Yes') & (df_sample2.insurance == 'Medicare') & (df.length_of_stay >= 5), 'readmission'] = 1

Finally let’s append df_sample1 to df_sample2 and store the result in a new variable df:

df = df_sample2.append(df_sample1)

The full cell is as follows:

Embedding Created by Author

Next, let’s use the counter method to look at the number of 0 and 1 readmission outcomes. As we can see the data is imbalanced with 517 readmissions values equal to 1 and 4405 readmission values equal to 0:

Embedding Created by Author

Building Catboost Classifier

Now we can start preparing our data for our classification model. Let’s convert our categorical columns to machine readable codes. While catboost can handle categorical variables directly, this will make using tabgan easier later on:

Embedding Created by Author

We can now define our input and output for modeling. We will use ‘insurance’, ‘sex’, ‘smoker’, ‘state’, ‘height’, ‘weight’, ‘bmi’, ‘length_of_stay’ to predict ‘readmission’:

Embedding Created by Author

Next let’s split our data for training and testing:

Embedding Created by Author

Now let’s import the catboost package:

Embedding Created by Author

We can now define our model object and fit it to our training data. For simplicity let’s use 10 iterations for our catboost model(which is the same as n_estimators):

Embedding Created by Author

Now let’s evaluate the performance of our model. Since our data is imbalanced precision is a good metric to use to measure how well the model predicts the underrepresented class. A precision value of 1.0 means the model has perfect precision:

Embedding Created by Author

We see that the model predicts all of the values in the test set to have readmission values of 0, even though there are 146 ground truth values of 1 in the test set. This corresponds to a precision value of 0. This is a typical issue when building classification models on imbalanced data. Ideally tabular GANs can help us increase model precision through data augmentation.

Augmenting data with Tabgan

Let’s start by importing tabgan:

from tabgan.sampler import GANGenerator

Let’s define our columns, our input and output, and reset indices:

cols = ['insurance', 'sex', 'smoker', 'state', 'height', 'weight', 'bmi', 'length_of_stay', 'readmission']X = df[cols[:-1]]
y = pd.DataFrame(list(df['readmission']), columns=['readmission'])
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)

Next we will split our data for training and testing:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

And finally generate our synthetic data with the GANGenerator. This code will generate a new data frame of inputs (new_train2) and outputs (new_target2) , where synthetic input data are appended to X_train and stored in new_train2, and synthetic target values are appended to y_train and stored in new_target2. If you are curious about the parameters involved with the GANGenerator and generate_data_pipe please see the tabgan documentation.

new_train2, new_target2 = GANGenerator(cat_cols = cols, epochs=2, is_post_process=False).generate_data_pipe(X_train, y_train, X_test,use_adversarial=False, only_generated_data=False)

The full cell is as follows:

Embedding Created by Author

We can now look at the distribution in 1s and 0s for readmission using the counter method:

We see that significantly more data has been generated. In our original data set we had 4405 readmission values equal to 0 and 517 readmission values equal to 1. Compare this to the screenshot above where we now have 8102 values equal to 0 and 2921 values equal to 1:

Embedding Created by Author

We can also display the new input data, new_train2:

Embedding Created by Author

Now let’s train a new catboost model on our augmented data:

Embedding Created by Author

And evaluate the performance:

Embedding Created by Author

We see that we have a slight improvement in precision. This can further be improved with hyperparameter tuning the GANGenerator. For example, you can try increasing the number of epochs from 2 to a larger number like 50 or 100.

The code used in this post can be found on GitHub.

Conclusions

Here we looked at how to use open-source python packages to consider the real healthcare use case of predicting patient readmission. First, we used the faker package to generate patient attributes that we subsequently used to train a readmission classification model. We then saw how to use the tabgan package to augment the training data we generated with synthetic values produced by our GANgenerator. We the saw that through this method of data augmentation we were able to improve the precision of our classification model. This method of data augmentation may also have use in healthcare prediction problems not covered here. This includes rare disease detection, adverse drug-drug interaction events, patient population risk classification for value-based care and much more.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment