With reference to this post on feature scaling, and many tutorials out there, it is mentioned how we should avoid data snooping by performing say, feature scaling on the train set, and then use the mean and std from the train set to transform the test set.
I can understand this idea, but when extending to say, KFold cross validation (K=5), how do we then determine the mean and std for our final test set?
I was thinking the below:
- Split X, y into 3 sets, X_train, X_test (note no validation set here since we will be splitting X_train into 5 folds).
- X_train is further split into X1,X2,X3,X4,X5 (for simplicity, I dropped the train subfix.
- We will train the model five times, for example, use X2 - X5 as train and evaluate on X1. Use X1-X4 as train, and evaluate on X5.
- The confusion arises here, during training for each fold, we should perform feature scaling on only 4 folds (the training folds), and use the mean and std from these 4 folds to transform the remaining 1 validation fold. But this will mean we have 5 such pairs of mean - std since CV=5.
- How do we deduce which mean and std to use for the final final prediction on test set? Logically, we just choose the mean and std of the best performing fold (out of 5 folds?) This leads to the final question.
- Don't people typically take the average of the 5 fold predictions at the end when doing inference, so do we still take the best performing folds' mean and std?
Here you are using cross validation for Hyperparameter tuning and after you choose the Hyperparameter, you use this set of “ideal” Hyperparameter to retrain your whole train set (where this time you use the mean and std of the whole train set).
– nan Sep 24 '21 at 14:29Thanks a lot, I understood the idea now.
That’s why people like me should never jump the gun and learn deep learning without a solid foundation in classical ML.
– nan Sep 24 '21 at 14:48