Should I use GridSearch CV for hyper-parameter tuning in a data-rich context?

Question

My textbook states that k-fold cross-validation is a resampling technique that is useful for estimating generalization error in a data-poor setting.

Ideally, if we had enough data, we would set aside a validation set and use it to assess the performance of our prediction model. Since data are often scarce, this is usually not possible. To finesse the problem, K-fold cross- validation uses part of the available data to fit the model, and a different part to test it. (Hastie, The Elements of Statistical Learning, Section 7.10, page 241.)

For a machine learning project, I used Scikit-learn's grid-search cv method to find the optimal hyper-parameters for my random forest. The Scikit-learn docs recommend exactly this:

It is possible and recommended to search the hyper-parameter space for the best cross validation score. ... Two generic approaches to parameter search are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations.

However, my professor disagreed: he asked me why I used cross-validation in a data-rich context (Fashion-MNIST dataset).

Was grid-search CV inappropriate in a data-rich problem? I don't know how to resolve this discrepancy, and my professor's feedback was not particularly instructive.

score 1 · Answer 1 · answered Feb 24 '22 at 05:26

1

When you are doing hyperparameter tuning on Testing data , it puts you at risk of introducing Bias as training data should act as unseen data and no optimisation should be done on that. So to avoid this scenario completly you can follow two approach:

Divide your training data into train and validation. Where Validation data will help you in tuning the hyperparameters. If you have lot of data you can follow this approach and once best hyperparam are found train the final model on training data completly.

I also dont see any mistake in using CV even in case of data rich components.

answered Feb 24 '22 at 05:26

Ashwiniku918

1,994
4
17

Right, that makes sense to me. So you'd say that an alternative approach would be to tune the hyperparameters by training once on the training set and scoring on the validation set? Shouldn't we want a statistical error estimate on such a score? – He Doesn't Row Feb 24 '22 at 09:12
1

"you can follow two approach", but you only list one? – Ben Reiniger Feb 24 '22 at 16:49

score 1 · Answer 2 · answered Mar 15 '22 at 14:39

First, there may be a confusion in terminology. "Cross-validation" historically means just scoring a model on dataset(s) not used to train the model, and includes the simple train-test split. Note that sklearn's hyperparameter tuners can use a simple train-test split, by setting cv to something other than just an integer. Recent (and maybe data science specifically?) usage has tended to emphasize the "(data-)exhaustive" methods like k-fold.

Now the question remains: when to use k-fold over a train-test split (for hyperparameter tuning). As Hastie says, and your professor alludes, k-fold is quite nice in the data-sparse setting: it allows you to use all of your data more fully, and gets more stable estimates than a single split, as well as a variance estimate as you note. When there is sufficient data, a single split leave enough data to provide a good estimate, and the larger dataset means higher training times, so repeating the procedure $k$ times may be burdensome; however, as you say, doing k-fold gives you the added benefit of a variance estimate.

score -1 · Answer 3 · answered Feb 23 '22 at 19:51

-1

I have the feeling that this is not conclusive. You may take a look at this page. And at least rationally speaking and in theory, random search should be better than grid search. You can refer to this paper.

answered Feb 23 '22 at 19:51

user559678

1

Should I use GridSearch CV for hyper-parameter tuning in a data-rich context?

3 Answers3