I'm a data science newbie and a bit confused with the following:
I usually do the preprocessing on all predictors of a dataset, meaning
I create X by concatenating X_train and X_test.
(Imagine a competition where you download test and training data separately.)
After the preprocessing I use scikit's train_test_split to split the data into train and test data.
I was wondering if doing the preprocessing on X altogether can lead to train-test contamination or target leakage.
I saw that you shouldn't do
X_valid = imputer.fit_transform(X_valid)
for example. And that
X_train = imputer.fit_transform(X_train)
X_valid = imputer.transform(X_valid)
is a better option.
train_test_split, to apply the same code for preprocessing to the test data? Thank you for your help! – LeLuc Apr 24 '21 at 08:16As of now, I'm not too familiar with the sklearn pipeline and
– LeLuc Apr 24 '21 at 09:53.fit/.fit_transformmethods, so, therefore, I do it manually.