Data Leakage

Data leakage is when information from outside the "training dataset" is used to create the model; causing high performance on the training set, and possibly even the validation data, but the model will perform poorly in production.

Example

Data(such as test dataset) is used in training and Model Evaluation which improves prediction score, giving a false sense of model's ability.


Types:

  • Target leakage: It occurs when predictors(y) include data that will not be available at during predictions.
  • Train-Test Contamination: It occurs when training data and validation data are mixed.