Avoiding data leakage in preprocessing

Question

I'm a data science newbie and a bit confused with the following:
I usually do the preprocessing on all predictors of a dataset, meaning I create X by concatenating X_train and X_test. (Imagine a competition where you download test and training data separately.)

After the preprocessing I use scikit's train_test_split to split the data into train and test data.

I was wondering if doing the preprocessing on X altogether can lead to train-test contamination or target leakage.

I saw that you shouldn't do

X_valid = imputer.fit_transform(X_valid)

for example. And that

X_train = imputer.fit_transform(X_train)
X_valid = imputer.transform(X_valid)

is a better option.

score 3 · Accepted Answer · answered Apr 23 '21 at 22:23

3

yes, it will. If you do something like a feature selection on the whole dataset, then you will be overfitting to the test set, and your results will look better than they really are.

If you do some preprocessing that is not informed by the target variable, e.g., scaling of variables, then you are not really leaking information about the target, however, you are creating a dependence between the training set and the test set, which also creates problems that are often unintuitive and hard to explain, so it's better to avoid it. In any case, strictly speaking, you are then not estimating how well would your model perform on unseen data, since truly unseen data would not be included in your whole set preprocessing.

answered Apr 23 '21 at 22:23

rep_ho

7,589
1
27
50

Therefore one would have to preprocess both separately? Let's say I go through all the steps manually for the training data... would it be possible then, before doing the train_test_split, to apply the same code for preprocessing to the test data? Thank you for your help! – LeLuc Apr 24 '21 at 08:16
1

you will just use .fit in the training set and then .transform in the training set (I think that'ts how it works in sklearn, I am not sure). So the parameters used for the preprocessing are learned in the training set and only applied to the test set. Then, if you just want to apply your model to completely new datapoint you will also just apply .transform – rep_ho Apr 24 '21 at 08:50
Sure, but I'm mean if you don't use a pipeline, but do it manually instead and have maybe 80 lines of code or something... then you would apply all that code to the test set, I guess?
As of now, I'm not too familiar with the sklearn pipeline and .fit / .fit_transform methods, so, therefore, I do it manually.
– LeLuc Apr 24 '21 at 09:53
2

so for example you want to center and scale your variables as a part of preprocessing i.e., subtract the mean and divide by standard deviation. The correct way would be to learn the mean and standard deviation in the training set and then subtract that mean from the training set data and the test set data. You don't want to subtract a different mean from the training and the test set or to learn the mean from the whole dataset – rep_ho Apr 24 '21 at 11:06

Avoiding data leakage in preprocessing

1 Answers1

Linked