How to choose the mean and std when using KFold cross validation?

Question

With reference to this post on feature scaling, and many tutorials out there, it is mentioned how we should avoid data snooping by performing say, feature scaling on the train set, and then use the mean and std from the train set to transform the test set.

I can understand this idea, but when extending to say, KFold cross validation (K=5), how do we then determine the mean and std for our final test set?

I was thinking the below:

Split X, y into 3 sets, X_train, X_test (note no validation set here since we will be splitting X_train into 5 folds).
X_train is further split into X1,X2,X3,X4,X5 (for simplicity, I dropped the train subfix.
We will train the model five times, for example, use X2 - X5 as train and evaluate on X1. Use X1-X4 as train, and evaluate on X5.
The confusion arises here, during training for each fold, we should perform feature scaling on only 4 folds (the training folds), and use the mean and std from these 4 folds to transform the remaining 1 validation fold. But this will mean we have 5 such pairs of mean - std since CV=5.
How do we deduce which mean and std to use for the final final prediction on test set? Logically, we just choose the mean and std of the best performing fold (out of 5 folds?) This leads to the final question.
Don't people typically take the average of the 5 fold predictions at the end when doing inference, so do we still take the best performing folds' mean and std?

gunes · Accepted Answer · 2022-08-24T07:24:12.227

5

In the presence of a test set, the Kfold is tpically done for hyper parameter optimization, and after that, you should train your model with the whole training set and estimate the mean/std from it to be used on the test set.

edited Aug 24 '22 at 07:24

answered Sep 24 '21 at 14:25

gunes

57,205

2

Thanks. Oh this clears up the air. Thanks so much. So this is a bit different from deep learning (say image recognition) where you train five folds and use five weights to perform inference and average the predictions.
Here you are using cross validation for Hyperparameter tuning and after you choose the Hyperparameter, you use this set of “ideal” Hyperparameter to retrain your whole train set (where this time you use the mean and std of the whole train set).
– nan Sep 24 '21 at 14:29
2

@nan That is right. And, this is model agnostic. It can be applied to deep learning as well, and also the method you described can be applied to algorithms other than DL. – gunes Sep 24 '21 at 14:32
@gung Thanks. In classical ML setting, we still can average the five fold prediction during cross validation. But if I really want to go one step further and extend what I usually do in DL, where I use five different set of weights learned for each fold, and apply it to the test set, and then finally averaging the five predictions; can this idea be used here? Since we are learning five sets of “parameter/weights” for each fold. As I’m referring to parameter and not Hyperparameter, because the optimal Hyperparameter is now known after five folds, but weights can be different per fold. – nan Sep 24 '21 at 14:40
1

Yes, this is called meta-estimation. – gunes Sep 24 '21 at 14:47
1

Roger that @gunes.
Thanks a lot, I understood the idea now.

That’s why people like me should never jump the gun and learn deep learning without a solid foundation in classical ML.
– nan Sep 24 '21 at 14:48
sorry, one last question with respect to the meta estimator idea. The confusion in this method lies in now I am no longer retraining the whole train set (since I’m using five different folds to predict). In this case, what mean and std do I use for each fold? – nan Sep 24 '21 at 15:05
2

You’ll use each model’s own mean/std just as you use each model’s own weights. – gunes Sep 24 '21 at 15:14

score 3 · Answer 2 · answered Sep 24 '21 at 14:24

3

Everything you do to the data is part of your model. This includes standardization.

In each fold, you need to pretend that the fold is your only training set. This means that for 5 fold cross validation, you would learn a new mean and standard deviation and apply that to the hold out set prior to predicting.

Sklearn makes this very easy by using sklearn.pipeline.Pipe and sklearn.preprocessing.StandardScaler.

answered Sep 24 '21 at 14:24

Demetri Pananos

36,121

Thanks for the fast answer. So we learn 5 pairs of mean and std during five folds cv. However, in the final test set, which pair of mean and std do we choose? – nan Sep 24 '21 at 14:26
1

@nan Cross validation is meant to estimate generalization error. You learn the final mean, std when you train on the full training set. – Demetri Pananos Sep 24 '21 at 14:34
3

+1 Always remember that any out-of-sample testing mimics the real application where you want something like Siri doing speech recognition on sentences that have yet to be spoken (perhaps by people who have yet to be born). You literally pretend that all out-of-sample data do not exist, because in the real application, such data really might not yet exist! – Dave Sep 24 '21 at 15:03
@Dave thanks. In a typical machine learning project, can I treat the error on the test set to be the out of sample error? – nan Sep 24 '21 at 15:34
2

@nan Once you finalize the model you're going to use, then yes the error on the test set is the estimate of the generalization error assuming the data generating processes stays fixed – Demetri Pananos Sep 24 '21 at 15:47

How to choose the mean and std when using KFold cross validation?

2 Answers2

Linked