Techno Blender
Digitally Yours.

How Data Leakage affects model performance claims | by Georgia Deaconu | Jan, 2023

0 38


Image generated by the author using dreamstudio.ai
  1. The improper separation between training and test datasets
  2. The usage of features that are not legitimate (proxy variables)
  3. The test set is not drawn from the distribution of interest
  • performing missing values imputation or scaling before splitting the two sets. By using the complete data set to compute imputation parameters (mean, standard deviation, etc.), some information that shouldn’t be available to the model during its training is introduced in the training set
  • performing under/oversampling before splitting the two sets also leads to an improper separation between the training and test sets (oversampled data from the training set would be present in the test set leading to optimistic conclusions)
  • not removing duplicates from the data set before splitting. In this case, the same values could be part of the training and test sets after splitting, leading to optimistic evaluation metrics.
Correlation between some features and the target variables (Image by the author)
  • Temporal leakage: if a model is used to make predictions about the future, then the test set should not contain any data that pre-dates the training set (the model would be built based on data from the future)
  • Non-independence between train and test samples: this problem arises more in the medical domain, where several samples are collected from the same patients over some period of time This issue can be handled by using specific methods such as block cross-validation, but it is a difficult problem in the generic case since all the underlying dependencies in the data might be known
  • Sampling bias: choosing a non-representative subset of the dataset for evaluation. An example of such bias would be choosing only cases with extreme depression to evaluate the effectiveness of an anti-depressive drug and make claims about the drug’s effectiveness for treating depression in general


Image generated by the author using dreamstudio.ai
  1. The improper separation between training and test datasets
  2. The usage of features that are not legitimate (proxy variables)
  3. The test set is not drawn from the distribution of interest
  • performing missing values imputation or scaling before splitting the two sets. By using the complete data set to compute imputation parameters (mean, standard deviation, etc.), some information that shouldn’t be available to the model during its training is introduced in the training set
  • performing under/oversampling before splitting the two sets also leads to an improper separation between the training and test sets (oversampled data from the training set would be present in the test set leading to optimistic conclusions)
  • not removing duplicates from the data set before splitting. In this case, the same values could be part of the training and test sets after splitting, leading to optimistic evaluation metrics.
Correlation between some features and the target variables (Image by the author)
  • Temporal leakage: if a model is used to make predictions about the future, then the test set should not contain any data that pre-dates the training set (the model would be built based on data from the future)
  • Non-independence between train and test samples: this problem arises more in the medical domain, where several samples are collected from the same patients over some period of time This issue can be handled by using specific methods such as block cross-validation, but it is a difficult problem in the generic case since all the underlying dependencies in the data might be known
  • Sampling bias: choosing a non-representative subset of the dataset for evaluation. An example of such bias would be choosing only cases with extreme depression to evaluate the effectiveness of an anti-depressive drug and make claims about the drug’s effectiveness for treating depression in general

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment