Five Hidden Causes of Data Leakage You Should Be Aware of | by Donato Riccio | Apr, 2023

By Jessie Hobb On Apr 11, 2023

And How They Sabotage Machine Learning Models

Data leakage is a sneaky issue that often plagues machine learning models. The term leakage refers to test data leaking into the training set. It happens when the model is trained on data that it shouldn’t have access to during training, leading to overfitting and poor performance on unseen data. It’s like training a student for a test using the test answers — they’ll do great on that specific test, but not so well on others. The goal of machine learning is to create models that can generalize and make accurate predictions on new, unseen data. Data leakage undermines this goal, and it’s important to be aware of and prepare against it. In this article, we’ll take a closer look at what data leakage is, its potential causes, and ways to prevent it with practical examples using Python and scikit-learn, and cases from research.

Overfitting. One of the most significant consequences of data leakage is overfitting. Overfitting occurs when a model is trained to fit the training data so well that it is not able to generalize to new data. When data leakage occurs, the model will have a high accuracy on the the train and test set that you used while developing it. However, when the model is deployed, it will not perform as well because it cannot generalize its classification rules to unseen data.
Misleading Performance Metrics. Data leakage can also result in misleading performance metrics. The model may appear to have high accuracy because it has seen some of the test data during training. It’s thus very difficult to evaluate the model and understand its performance.

The first case we are presenting is the simplest one, but probably the most common: when preprocessing is performed before the train/test split.

You want to use a StandardScaler to standardize your data, so you load your dataset, standardize it, create a train and test set, and run the model. Right? Wrong.

Comparison of different split strategies. Source.

And How They Sabotage Machine Learning Models

Overfitting. One of the most significant consequences of data leakage is overfitting. Overfitting occurs when a model is trained to fit the training data so well that it is not able to generalize to new data. When data leakage occurs, the model will have a high accuracy on the the train and test set that you used while developing it. However, when the model is deployed, it will not perform as well because it cannot generalize its classification rules to unseen data.
Misleading Performance Metrics. Data leakage can also result in misleading performance metrics. The model may appear to have high accuracy because it has seen some of the test data during training. It’s thus very difficult to evaluate the model and understand its performance.

The first case we are presenting is the simplest one, but probably the most common: when preprocessing is performed before the train/test split.

You want to use a StandardScaler to standardize your data, so you load your dataset, standardize it, create a train and test set, and run the model. Right? Wrong.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Five Hidden Causes of Data Leakage You Should Be Aware of | by Donato Riccio | Apr, 2023

And How They Sabotage Machine Learning Models

The solution: Pipelines

And How They Sabotage Machine Learning Models

The solution: Pipelines