CV for Model selection - Why is significance testing not needed?

Question

Say we have a model and some hyper-parameter with L values, and our goal is model selection. A k-fold CV outputs L accuracies (each accuracy is an average over K values). The best model corresponds to the highest accuracy.

When comparing the L accuracies in order to select the best model, why don't we do any significance testing?

My guess is that there is no cost in wrongly rejecting the null hypothesis (== saying one model is better than the other while they are equivalent), therefore there is no harm in selecting the model with the highest accuracy. The worse that can happen is that all L models are equivalent, but even if that's the case we still want to select one of them, so there is no harm in selecting the highest accuracy (it's as arbitrary as anything else)

It makes sense to do significance testing but it isn't trivial, see for example https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/ — Christian Hennig, Sep 25 '22 at 10:17

score 1 · Answer 1 · answered Sep 25 '22 at 08:57

There isn't usually a genuine statistical distinction between hyper-parameters and parameters, it is normally just a matter of computational convenience. There is a computationally efficient method of optimising the parameters for fixed values of the hyper-parameters, so it is sensible to use it. We call the parameters for which there is no computationally efficient algorithm "hyper-parameters" and use cross-validation to tune them instead, which is often computationally expensive. So the main reason we don't need statistical hypothesis tests for hyper-parameters is that we don't usually need them for parameters either.

However, it is possible to over-fit the cross-validation error when tuning the hyper-parameters, just as you can over-fit the training data when optimising the parameters. This can result in a model that under-fits the training data as well as over-fitting it. There are a number of practical solutions, such as to regularise the hyper-parameters, another is to tune the hyper-parameters and then backtrack to the point where the cross-validation performance is not statistically distinguishable from that of the best set of hyper-parameters, so you can use NHSTs in model selection if you want to (sort of "early stopping").

Not sure I completely follow the argument: in classical statistics, when we decide e.g. whether to drop or include another factor, we effectively also choose between models. We do employ hypothesis testing there, and for good reasons. I'd say: we have those 2 extremes, your view point of hyperparameters fitting is like parameter fitting and the other: model selection that accounts for random uncertainty. Hyperparameter fitting is somewhere in between. Will try to add an answer later — cbeleites unhappy with SX, Sep 26 '22 at 08:54
@cbeleitesunhappywithSX yes, that is a good point, when performing feature selection, we are usually doing it in order to understand the data rather than just to tune the performance of the model (choosing features often makes that worse rather than better), so it depends on why you are performing model selection. The question was posed as model selection involving hyper-parameters, rather than architecture selection, which perhaps guided my answer to be overly specific to that.Look forward to your answer! — Dikran Marsupial, Sep 26 '22 at 09:26

score 0 · Answer 2 · answered Oct 02 '22 at 11:18

Summary: IMHO, it would be good to do significance testing for model selection. (But as @Christian Hennig already pointed out, it's actually not so easy to do)

Or rather: we'd need to get a far more realistic idea of the random uncertainty our hyperparameter tuning is subject to.
Whether that takes the shape of significance testing, Bayesian model averaging, or more sophisticated cross validation, any other for or a combination thereof I don't care.
But I do think we have overall a serious problem with overfitting. (Which I think also has strong links to the reproducibility crisis in the life sciences)

@Dikran Marsupial points out that hyperparameters are not inherently different from other parameters of a model. The distinction is mainly a question of implementation convenience, leaving parameters that are difficult to fit correctly in the general case to the care of the user.

This is also basically the approach that has been used for many decades in classical statistics: e.g. for a linear model, we know how to calculate a least squares fit on given data. It is the genuine task of the one who models some data to specify the model terms, including model complexity, feature selection (which variates to include/drop) and feature construction (e.g. polynomial terms).

to be continued

CV for Model selection - Why is significance testing not needed?

2 Answers2