Let's say I'm comparing 60 different model hyperparameter value combinations using 10-fold cross-validation. It's tempting to simply select the hyperparameter combination whose mean accuracy is highest across the folds. However, should one make use of the standard deviation of the accuracies when deciding on the best hyperparameter combination? If so, any particularly rule of thumb (e.g. go with the hyperparameter combination that has the highest mean accuracy amongst the better half in terms of standard deviation)
2 Answers
Sort of. There is the so-called "one standard error rule," which does use the standard deviation of the prediction error estimates, although not in quite the way you mentioned: instead you divide the standard deviation by the square root of the number of estimates to form the standard error of the mean estimate.
The one standard error rule says: pick the simplest model whose mean estimated prediction error is within 1 standard error of the best-performing model's estimated prediction error. In practice, the "simplest model" usually means "the most strongly regularized model." And of course, the "best-performing model" is the one with the lowest mean estimated prediction error of all models tested.
Stated a bit more plainly, the rule says that we want to pick the simplest model that still does essentially as good a job as the best-looking model -- the best-looking model could be far more complicated, despite only a marginal increase in performance.
- 12,557
In addition to @jake-westfall's answer which gives you an easily applied rule of thumb:
The variance you observe between the accuracies/errors of the different folds is composed (at least) of some variance due to the limited number of cases tested by that surrogate model and some variance due to the variations between the surrogate models (instability). The latter is the variance you want to trade off against bias with your regularization, while the former hampers your ability to detect improvements. So for a second look, I recommend to check that the variance due to the number of tested cases is low enough to sensibly allow the comparison.
- 38,684
glmnetis using? – amoeba Feb 02 '18 at 10:54sd(x)/sqrt(length(x))in R. – Jake Westfall Feb 02 '18 at 14:23