What Is Data Leakage, and How Can It Be Avoided in Machine Learning? | by Suhas Maddali | Jun, 2022

By Jessie Hobb On Jun 14, 2022

While the metrics that are used in machine learning can show impressive results on the test set, they can sometimes be misleading unless understood thoroughly.

After performing all the tasks and the workflow of machine learning, such as the data collection, data visualization, data processing, data manipulation and training, you are yet to perform one of the interesting tasks which is to analyze your models and evaluate their performance. In order to do this, you divide the overall data into 2 parts where the first part, which often includes the majority of samples, is used to train the machine learning models while the remaining samples are used to test how well they are performing on the test data or the data that the models have not seen before.

After performing the training and waiting for the ML models to generate good results on various metrics such as accuracy, precision, recall, and F1 score in the case of classification and mean squared error, mean absolute error and root mean squared error along with R-squared errors in the case of regression, you decide to deploy the machine learning model that performs the best in the test set. However, it should be noted that there is an important mechanism that should be understood before deploying the model in real time. While the ML model performed quite impressively in the case of the test data, deploying it in real time can sometimes be detrimental to the value that this algorithm creates if the phenomenon of data leakage is not understood or checked before deployment.

What is data leakage?

In the beginning when we are about to train a machine learning model, it is implicit that we divide the data into 3 parts, namely the training, cross-validation, and the test set. After training the model and letting the algorithm learn the most important parameters to get good performance, we use the cross-validation to tune those hyperparameters to ensure that we get an even better performance for these tweaked hyperparameters. Finally, we take our test set to see how well the model is performing on the test set and monitor its performance.

Before doing these steps, people sometimes use one hot encoding or feature standardization for the whole dataset instead of only applying it to the training set. By applying these procedures to the entire dataset, there is a leak in the information in that the model is learning that test set has some information associated with the training data and therefore, it can perform well on the test set as well due to the fact that it was actually trained well on the training set. In this way, we tend to get a good performance on the test set which leads us to believe that we are doing pretty well and give us positive light to deploy the model in real-time. However, the model might actually be performing a lot worse when it is deployed. Therefore, it is also important to monitor the predictions of the ML models after deploying to deal with this phenomenon.

Nonetheless when the same model is deployed in real time, we tend to see a degradation in performance because there is a lot of uncertainty and there is no prior information about the mean of the entire data which is as a result of data leakage during the testing phase. Therefore, it is important to consider this phenomenon in machine learning and understand it further before we can actually deploy a higher quality model in real time.

Ways to overcome data leakage

There are various ways in which data leakage can be prevented, which we will actually go through in the next few paragraphs.

K Fold Cross Validation

One of the best ways to get rid of data leakage is to perform k-fold cross validation where the overall data is divided into k parts. After dividing into k parts, we use each part as the cross-validation data and the remaining as training data. After measuring the performance for each set of k parts, we take the average to present the overall performance of the model.

Dropping the duplicates

It can also be that the data that is used for training and cross-validation contains duplicate values. This means that we have the same rows, and this can sometimes have an impact in giving us an inflated picture about our model’s performance. Consider, for example, that we are dividing the overall data into training and test set. When we have duplicate rows, there is a possibility that one of the duplicate rows is in the training data while the other is in the test set. In this case since the model was already trained on the training data with this row and it is also present in the test set, we get an increased performance in the test set which is actually not true. Therefore, it is a good idea to check if the dataset contains any duplicate values.

Removing features that are highly correlated with the output or target variable

When we are performing machine learning analysis, there can often be an assumption that higher a given feature is correlated with the target or the output label, the better are the predictions from the ML models. While this can be true and should be considered when developing a model, there can often be instances where we are not able to get this feature in real-time.

Consider an example of bank transaction and an ML model that is deployed to monitor and check whether a given customer is going to exceed the transaction limit per day. In this case, if we have a feature such as the overall daily expenses, it can have a direct correlation with the output feature. When this model is deployed in real-time and if we do not know at that instant what the daily expense of a customer was, the model simply does not have the most important information for it to decide whether a customer is going to exceed the transaction limit on that particular day. In this case, we are mostly relying on the feature “daily expenses” to predict whether a customer is going to exceed transaction limits. Therefore, the feature “daily expenses” is highly indicative of whether a customer has exceeded the transaction limit or not. This feature is causing data leakage which can be avoided by dropping it.

Performing Temporal Ordering in the case of Time Series Forecasting

When actually performing methods such as random splitting, we are randomly permuting or rearranging the rows so that we are then able to divide the data into training and test set. In the case of time series forecasting, this method should not be performed due to the temporal dependency of the target variable (output variable) to the present input value and the previous time steps as well. If we randomly split the data, there is already a presence of future information for the ML model to make the best predictions. As a result, we are going to get a very good performance for the metric (e.g. accuracy) under consideration. In the case of time series forecasting, therefore, care must be taken such that we are splitting the data temporally rather than randomly to avoid data leakage.

Conclusion

All in all, we’ve seen situations where the presence of leaky data could lead to us believing that the models are performing quite well on the test set which is far from true. Therefore, steps can be taken to ensure that there is no data leakage by regularly checking if we are taking the mean of the entire data or only the training data to perform data standardization. Similar procedure must also be done for One-Hot encoding, and this ensures that there is no data leakage on the test data. We have also looked into ways at which we can reduce data leakage in the article. Thanks for taking the time to read this article.

Below are the ways where you could contact me or take a look at my work. Thanks.

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

LinkedIn: (1) Suhas Maddali, Northeastern University, Data Science | LinkedIn

Medium: Suhas Maddali — Medium