How Data Leakage affects model performance claims | by Georgia Deaconu | Jan, 2023

By Jessie Hobb On Jan 3, 2023

This year has seen several important scientific advancements enabled by machine learning driven research. Along with the enthusiasm came also some worry related to the reproducibility issues encountered in ML-based science. Several methodological problems have been identified, out of which data leakage seems to be the most widespread. Generally, data leakage can skew results and lead to overly optimistic conclusions.

There are several different ways in which data leakage can occur. The objective of this post is to present some of the most commonly encountered types, along with a few tips about how to identify and mitigate them.

Image generated by the author using dreamstudio.ai

Data leakage can be defined as an artificial relationship between the target variable and its predictors which is unwillingly introduced through the data collection method or the pre-processing strategy.

The main sources of data leakage I will try to exemplify are:

The improper separation between training and test datasets
The usage of features that are not legitimate (proxy variables)
The test set is not drawn from the distribution of interest

Data scientists know that they need to divide their input data into train and test sets, only train their model using the training set and compute evaluation metrics only on the test set. This is a textbook error that most people know to avoid. However, the initial exploratory analysis is often performed on the complete data set. If this initial analysis also involves pre-processing and data cleaning steps, it can be a source of data leakage.

Pre-processing steps that can introduce data leakage:

performing missing values imputation or scaling before splitting the two sets. By using the complete data set to compute imputation parameters (mean, standard deviation, etc.), some information that shouldn’t be available to the model during its training is introduced in the training set

Correlation between some features and the target variables (Image by the author)

The main sources of data leakage I will try to exemplify are:

The improper separation between training and test datasets
The usage of features that are not legitimate (proxy variables)
The test set is not drawn from the distribution of interest

Pre-processing steps that can introduce data leakage:

performing missing values imputation or scaling before splitting the two sets. By using the complete data set to compute imputation parameters (mean, standard deviation, etc.), some information that shouldn’t be available to the model during its training is introduced in the training set

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.