Techno Blender
Digitally Yours.

Five Hidden Causes of Data Leakage You Should Be Aware of | by Donato Riccio | Apr, 2023

0 39


Photo by Linh Pham on Unsplash
  • Overfitting. One of the most significant consequences of data leakage is overfitting. Overfitting occurs when a model is trained to fit the training data so well that it is not able to generalize to new data. When data leakage occurs, the model will have a high accuracy on the the train and test set that you used while developing it. However, when the model is deployed, it will not perform as well because it cannot generalize its classification rules to unseen data.
  • Misleading Performance Metrics. Data leakage can also result in misleading performance metrics. The model may appear to have high accuracy because it has seen some of the test data during training. It’s thus very difficult to evaluate the model and understand its performance.
0.745

The solution: Pipelines

0.73
ROC AUC score (baseline): 0.75 +/- 0.01
1    700
0 700
Name: target, dtype: int64

ROC AUC score (with data leakage): 0.84 +/- 0.07

ROC AUC score: 0.67 +/- 0.00
image.png
Comparison of different split strategies. Source.
Matthews Correlation Coefficient. Image by author.
image.png
ChestXNet original paper. First version. Source.
image.png
ChestXNet original paper. Most recent version. Source.
  • If your model suddenly starts performing too well after making some changes, it’s always a good idea to check for any data leakage.
  • Avoid preprocessing the entire dataset before splitting it into training and test sets. Instead, use pipelines to encapsulate preprocessing steps.
  • When using cross-validation, be cautious with techniques like oversampling or any other transformation. Apply them only to the training set in each fold to prevent leakage.
  • For time series data, maintain the temporal order of observations and use techniques like time-based splits and time-series cross-validation.
  • In image data or datasets with multiple records from the same subject, use per-subject splits to avoid leakage.


Photo by Linh Pham on Unsplash
  • Overfitting. One of the most significant consequences of data leakage is overfitting. Overfitting occurs when a model is trained to fit the training data so well that it is not able to generalize to new data. When data leakage occurs, the model will have a high accuracy on the the train and test set that you used while developing it. However, when the model is deployed, it will not perform as well because it cannot generalize its classification rules to unseen data.
  • Misleading Performance Metrics. Data leakage can also result in misleading performance metrics. The model may appear to have high accuracy because it has seen some of the test data during training. It’s thus very difficult to evaluate the model and understand its performance.
0.745

The solution: Pipelines

0.73
ROC AUC score (baseline): 0.75 +/- 0.01
1    700
0 700
Name: target, dtype: int64

ROC AUC score (with data leakage): 0.84 +/- 0.07

ROC AUC score: 0.67 +/- 0.00
image.png
Comparison of different split strategies. Source.
Matthews Correlation Coefficient. Image by author.
image.png
ChestXNet original paper. First version. Source.
image.png
ChestXNet original paper. Most recent version. Source.
  • If your model suddenly starts performing too well after making some changes, it’s always a good idea to check for any data leakage.
  • Avoid preprocessing the entire dataset before splitting it into training and test sets. Instead, use pipelines to encapsulate preprocessing steps.
  • When using cross-validation, be cautious with techniques like oversampling or any other transformation. Apply them only to the training set in each fold to prevent leakage.
  • For time series data, maintain the temporal order of observations and use techniques like time-based splits and time-series cross-validation.
  • In image data or datasets with multiple records from the same subject, use per-subject splits to avoid leakage.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment