Advanced Missing Data Imputation Methods with Sklearn | by Bex T. | Jun, 2022

By Jessie Hobb On Jun 4, 2022

A comprehensive tutorial for learning to leverage powerful model-based imputation techniques.

Introduction

Despite the massive number of MOOCs and other online resources, there are still skill gaps in dealing with certain data problems. One example is properly dealing with missing data in real-world datasets. Beginners often take this problem lightly, and they are not to blame. Even though it is such a pressing issue, the complexity of missing-data problems has significantly been underestimated because of the availability of small, easy-to-work-with toy datasets.

As a result, many beginner data scientists don’t go beyond simple mean, median, or mode imputation. Though these methods may suffice for simple datasets, they are not a competent solution to handling missing data in large datasets.

Like any other stage of data science workflow, missing data imputation is an iterative process. You should be able to use multiple methods and compare their results effectively. While the basic techniques may perform well, it is rarely the case, so you need a few backup strategies.

This tutorial will introduce two more robust model-based imputation algorithms in Sklearn — KNNImputer and IterativeImputer. You will learn their basic usage, tune their parameters, and finally, see how to measure their effectiveness results visually.

Identifying the Type of Missingness

The first step to implementing an effective imputation strategy is identifying why the values are missing. Even though each case is unique, missingness can be grouped into three broad categories:

Missing Completely At Random (MCAR): this is a genuine case of data missing randomly. Examples are sudden mistakes in data entry, temporary sensor failures, or generally missing data that is not associated with any outside factor. The amount of missingness is low.
Missing At Random (MAR): this is a broader case of MCAR. Even though missing data may seem random at first glance, it will have some systematic relationship with the other observed features — for example — data missing from observational equipment during scheduled maintenance breaks. The number of null values may vary.
Missing Not At Random (MNAR): missing values may exist in large amounts, and the reason for the missingness is associated with factors beyond our control or knowledge.

Identifying which category your problem falls into can help narrow down the set of solutions you can apply.

Let’s further explore these missingness types using the Diabetes dataset:

There are five features with different proportions of missing values. A first step in identifying the missingness type is to plot a missingness matrix. This special plot is available through the missingno package, which we can import as msno:

This matrix shows how nulls are scattered across the dataset. White segments or lines represent where missing values lie. Glucose, BMI, and blood pressure columns can be considered MCAR because of two reasons:

The proportion of missing values is small.
Missing values are scattered entirely randomly in the dataset.

But, Insulin and SkinFoldThickness columns have unusually many missing data points. So, is there some relationship between their missingness?

To answer this, MSNO provides a missingness heatmap that shows the missingness correlation:

>>> msno.heatmap(diabetes);

We can see a strong correlation between Skin Thickness and Insulin from the plot. We can confirm this by sorting either of the columns:

>>> msno.matrix(diabetes.sort_values("Insulin"));

The plot shows that if a data point is missing in SkinThickness, we can guess that it is also missing from Insulin column or vice versa. Because of this connection, we can safely say the missing data in both columns are not missing at random (MNAR).

We can also see a weak correlation between blood pressure and skin thickness, which would indicate that blood pressure was not missing completely at random (MCAR) but has some relationship with missing values in skin thickness. In other words, it is missing at random (MAR).

It might take a while to wrap your mind around these missingness types. For a deeper insight, you can refer to my other article I wrote specifically on missingness types and the MSNO package:

Imputing With KNNImputer

Now, let’s some imputation methods.

Apart from the basic SimpleImputer, Sklearn provides KNNImputer class which uses the K-Nearest-Neighbors algorithm to impute numeric values. If you are not familiar with it, I recommend reading my separate article on it:

For reference, here is an excerpt from the article briefly touching on how KNN algorithm works:

“Imagine you have a variable with two categories which are visualized here:

Given a new, unknown sample, how do you tell to which group it belongs? Well, naturally, you would look at the surrounding points. But the result would be really dependent on how far you look. If you look at the closest three data points (inside the solid circle), the green dot would belong to red triangles. If you look further, (inside the dashed circle) the dot would be classified as a blue square.

KNN works the same way. Depending on the value of k, the algorithm classifies new samples by the majority vote of the nearest k neighbors in classification. For regression, which predicts the actual numerical value of a new sample, the algorithm takes the mean of the nearest k neighbors.”

KNNImputer is a slightly modified version of the KNN algorithm where it tries to predict the value of numeric nullity by averaging the distances between its k nearest neighbors. For folks who have been using Sklearn for a time, its Sklearn implementation should not be a problem:

With this imputer, the problem is choosing the correct value for k. As you cannot use GridSearch to tune it, we can take a visual approach for comparison:

In line 5, we plot the original SkinThickness distribution with missing values. Then, in lines 6–9, we impute the same distribution with different values of k and plot it on top of the original. The closer the imputed distribution comes to the original, the better was the imputation. Here, it seems k=2 is the best choice.

Imputing With Iterative Imputer

Another more robust but more computationally expensive technique would be using IterativeImputer. It takes an arbitrary Sklearn estimator and tries to impute missing values by modeling other features as a function of features with missing values. Here is a more granular, step-by-step explanation of its functionality:

A regressor is passed to the transformer.
The first feature (feature_1) with missing values is chosen.
The data is split into train/test sets where the train set contains all the known values for feature_1, and the test set contains the missing samples.
The regressor is fit on all the other variables as inputs and with feature_1 as an output.
The regressor predicts the missing values.
The transformer continues this process until all features are imputed.
Steps 1–6 are called a single iteration round, and these steps are carried out multiple times as specified by the max_iter transformer parameter.

This means that IterativeImputer (II) predicts not one but max_iter number of possible values for a single missing sample. This has the benefit of treating each missing data point as a random variable and associating the inherent uncertainty that comes with missing values. This is also called multiple imputation, and it is the base for most other imputation techniques out there (yes, there are so many others).

When all iterations are done, II returns only the last result of the predictions because, through each iteration, the predictions improve. The algorithm also has an early stopping feature which can terminate the iterations if there isn’t a considerable difference between rounds.

According to Sklearn, this implementation of IterativeImputer was inspired by the more popular R MICE package (Multivariate Imputation by Chained Equations). Let’s see it in action:

IterativeImputer is still an experimental feature, so don’t forget to include the second line of the above snippet.

When estimator is set to None, the algorithm chooses it on its own. But, after reading this official guide on IterativeImputer by Sklearn, I learned that BayesianRidge and ExtraTreeRegressor yield the best results.

Performance Comparison of the Different Techniques

It is time to test how well the imputations work. To accomplish this, we will be predicting if a patient has diabetes or not (outcome), so it is a binary classification task. Let’s build feature/target arrays:

We will test both KNNImputer and IterativeImputer using cross-validation. For the estimators, we will use the recommended BayesianRidge and ExtraTreesRegressor:

We can see from the final results that KNNImputer with seven neighbors is the best choice for the dataset when trained on RandomForests. Even though I mentioned that IterativeImputer would be more robust, you can never be sure. Maybe, we could have achieved better performance by tuning its parameters more.

Summary

Missing data is a problem that should be taken seriously. There is no point in spending hours learning complex ML algorithms if you can’t even fix the prerequisite data problems. Remember that a model is only as good as the data it was trained on. This means you have to do your best in dealing with missing data points as they are ubiquitous in real-world datasets.

In this article, you learned how to deal with missingness using two model-based techniques: KNNImputer and IterativeImputer. Below are links to their documentation, an official Sklearn guide on their usage, and related sources that will help your understanding:

Thanks for reading!