How to Ace the missing value replacement problem | by Pranay Dave | May, 2022

By Jessie Hobb On Jun 1, 2022

Top 5 Pro-tips for solving the most common problem in data science

Missing value replacement may sound trivial, but it may be the most important step in the machine learning process. The way in which you do it can have a big impact on your machine learning model.

In addition, as you will be creating some new data, you have some essential responsibilities towards data departments in the organization. Data is one of the most important assets. So as a data scientist if you are creating new data using missing value replacement, you will have to justify the output with business folks.

In this article, I will show you the top 5 tips to ace missing value replacement, so you look like a pro in such an important subject. I will be using K-Nearest Neighbor (KNN ) algorithm as a missing value replacement algorithm. However, the tips apply to other missing value replacement methods also.

Let me start with the dataset which I will be using in this story. I found one very interesting dataset on Kaggle called Spaceship Titanic (datasource citation available at end of the article). It is data based on a fiction story. The Spaceship Titanic was an interstellar passenger liner that got hit by a spacetime anomaly. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

Photo by Guillermo Ferla on Unsplash + data table by author

Sounds familiar to the Titanic problem, but it is more futuristic! The data has got passenger details such as a home planet, destination, age, and various services. Many of the columns have missing values. So now let us get down to the top tips which will help you ace the missing value replacement.

The best way to identify and define what is a missing value is to make a visualization. It avoids using any pre-conceptions. Here is a bar chart of the field home planet. The fact that there is something missing is glaring in our eyes. It is a nice way to identify and define missing value. We can say that the missing value is a blank. As every passenger has a home planet, we need to replace blank with some value.

Bar chart on the number of passengers by home planet (image by author)

A similar histogram for a numerical column Room Service and Age is shown below.

For room service, zero values are very common. This is normal as not all passengers take room service. However, for Age, there are zero values and they are not the most common. Even though we are in the future, I am sure that unborn people are not considered passengers! So zero value for age is definitely a missing value.

Generally, the effort of missing value is done within the context of machine learning. So it is important to check the missing value with respect to the target class. Missing value replacement generally works well if the data distribution with respect to the target class remains balanced.

The target variable in the dataset is whether the passenger was lost by getting transported to an alternate dimension. The visualization below shows the distribution of the Home planet with respect to the target variable. We can observe that the proportion of blank values is similar for both the target class. Additionally, the proportion of missing values is very less. This means that replacing the missing values will not drastically alter the shape of the data. This is good news and we can be confident about communicating missing value ideas to data and business folks. It will not cause any panic with the data gods within the company!

Home planet vs. target class (image by author)

We make a similar analysis for the numeric column Age, using a box plot analysis as shown below. You will observe that the box plot for both the target class is more or less similar. Also, the zero value, which is the value to be replaced, is far from the median. This implies that there are not many zero values. So once again, we will not drastically alter the shape of the data and thus keeping the data gods in the company happy!

We are now all set to use the replace the missing values for the fields Home planet and Age. One of the very efficient algorithms for this job is the K-Nearest Neighbor (KNN) algorithm. The algorithm will try to find the nearest neighbor for records with missing values. For numeric values, it will replace with the average of nearest neighbors. For categorical values, it will replace with the most common values.

Once the algorithm has been executed, it is very useful to make a heat-map analysis of how many values got replaced.

KNN Before and After heatmap (image by author)

We can observe that all missing values for Age and Home Planet have been replaced. For Room service and other fields, we have not chosen to replace missing values, so they remain the same before and after.

Great, the KNN algorithm has done the magic to replace the missing values. But with what? We can only have confidence in the replacement algorithm only if we understand what is value after replacement.

This step depends upon what kind of algorithm you have used. In this article, I have used KNN, so I will specify the approach for this algorithm. As KNN is based on finding the nearest neighbor, it will use useful to have an idea of what are the nearest neighbor of the replaced values. One of the algorithms which can help visualize “nearness” is the dimensionality reduction algorithm TSNE (t-distributed stochastic neighbor embedding).

Shown here are the results of TSNE plotted on a 2D scatter plot.

The green points indicate that the data point had missing values and it has been replaced using the KNN algorithm. The purple points are data points with no missing values. As most of the green points are surrounded by purple points, we can have confidence in the KNN approach. This signifies that KNN is a good approach, as the nearest neighbors for missing values are not too far.

There is an isolated cluster of green points, which is zoomed in the figure below. A closer inspection shows that even though it is an isolated cluster, there are a few neighbors which helped fill the missing values.

Zoom in to an isolated cluster (image by author)

We can also verify the proportion based on the target class and compare it with earlier results. Shown below is an analysis of the Home planet. We can observe that the blank values (4%) got replaced by Earth(+2%), Europa(+%1) and Mars(+1%). This makes sense as most of the home planets correspond to Earth followed by Europa and Mars.

verify the proportion for the home planet before and after (image by author)

Similarly, we can observe that most of the zero-age values have been replaced by ages 12 -14 and 24–25. The overall histogram has not much changed, which is good news.

verify the proportion for the age before and after (image by author)

So now is the moment which we have been waiting for. We need to demonstrate if all efforts to manage the missing value were worth it or not. The best way to show this is to train and test the machine learning model twice — before and after missing value replacement.

Confusion matrix before and after replacement (image by author)

The ROC score goes up, which means that all efforts of missing value replacement are worth it! Even a 1% increase in score can help rise on the Kaggle board and do justice to the passengers in the galaxy!

In summary, missing value replacement changes the data. So it is important to do it with enough proof points on why and what is changed. The top 5 tips described here will help you ace the missing value replacement problem.

You can visit my website to do missing value replacements as well as other analytics with no coding : https://experiencedatascience.com

Here is a step-by-step tutorial and demo on my Youtube channel. You will be able to customize the demo to your data with zero coding.

Please subscribe in order to stay informed whenever I release a new story.

If you like what you read, you can also join Medium with my referral link. This is a direct way to support me and I will be very thankful to you.

The data is available here : https://www.kaggle.com/competitions/spaceship-titanic/overview

As specified in the rules (https://www.kaggle.com/competitions/spaceship-titanic/rules), in section 7 A, data can be used for any purpose.

A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education.

Top 5 Pro-tips for solving the most common problem in data science