0

I have a fairly common situation where I'd like to divide my data into folds and validate across them. But I have missing values throughout my data. I know I can create a Pipeline containing the SimpleImputer followed by my Estimator, and this will impute missing values before training.

But then what about for validation? Does the Pipeline properly impute the missing values in the validation set with the values it earlier determined from the training set? I've been combing the code and documentation for a while, and I'm honestly still not sure it's going to do the right thing.

According to Brownlee (https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/), "Running the example correctly applies data imputation to each fold of the cross-validation procedure.", but I'm not sure what that means.

My instinct is that during training, I should call fit on the Pipeline, and that should mean the SimpleImputer computes and stores the imputation values I need. Then when I call predict on the Pipeline, the transform method of everything but the last Estimator gets called, which fills the data with those values. But it frustrates me that this intuition isn't immediately ramified by the docs and code.

Pavel Komarov
  • 864
  • 10
  • 24

0 Answers0