When does model selection begin to overfit?

Question

Suppose you have a small dataset (perhaps 1000 labels), and you are using cross-validation to train different models and to choose the best one (according to their cross-validation scores).

It seems that at some point, this process (model selection) will begin to overfit. Is there any way to estimate the optimal number of models you should try, before this model selection process becomes counterproductive?

people try to push the problem down to lower turtles by maintaining a tri-fold data partition: test, train, and validate. We can compare promising models in terms of CV error on the basis of their validation error. The smaller the dataset the harder this strategy is. — John Madden, May 16 '23 at 20:54
Asside from the number of models, it also depends on the variation/flexibility among the different models. In any case it is not easy to overfit because the first cross validation step already reduces overfitting. You need models that 'accidentally' happen to fit well on both the training and validation data set. If those two data sets are large then this is not very likely. Possible overfitting might occur when the validation data is of low quality, and doesn't represent the population well. Selecting the best performance on a biased validation data set may not be good overall. — Sextus Empiricus, May 17 '23 at 15:00
It is also difficult to 'count' the number of models. How many models do we have when we train different LASSO or ridge regression models and allow the regularisation parameter to have any value among a range of infinitely many values (only limited by the computer precision)? — Sextus Empiricus, May 17 '23 at 15:23

score 1 · Answer 1 · edited May 17 '23 at 14:42

1

When too many models are tried, and they become too complex for the small dataset being used, overfitting can begin in model selection. It's a warning sign when the models begin to fit the noise rather than the real pattern.

It can be challenging to determine how many models are necessary. However, there is a method called "early stopping" that can be used in conjunction with monitoring validation scores. When the validation scores stop changing, the training of new models is halted using this method.

Always aim for the simplest solution. Overfitting is more likely to occur in datasets with fewer than 1000 labels, so using simpler models is recommended. Occam's razor is a failsafe method.

It's not a perfect science, after all. It's a matter of trying things out, gaining insight, and making adjustments as you go.

edited May 17 '23 at 14:42

MWB

1,327

answered May 17 '23 at 07:53

Turhan Can Kargin

11
2

Hi Turhan and Welcome to CV! Your answer could be greatly improved by providing a reference to the 'early stopping' method you mentioned. It may be useful to have a definition or pointer to some literature on the validation score. – utobi May 17 '23 at 08:02
Early stopping is, I would say, a well-known and perhaps trivial method to control overfitting. The challenge here is that you first need to notice that you are overfitting (during model selection) – MWB May 17 '23 at 08:11

score 1 · Accepted Answer · answered Jun 03 '23 at 01:54

Is there any way to estimate the optimal number of models you should try, before this model selection process becomes counterproductive?

A rather wonky and conservative guess can be made based on my answer here. Define:

$n$ is the number of independent observations which you evaluate on, e.g., the size of the validation set
$\epsilon$ is the maximum acceptable difference between the best model's estimated error rate and its true error rate
$p'$ is the maximum acceptable probability that your best model's error rate estimator misses the true error rate by more than $\epsilon$
$m$ is the pre-specified number of models to evaluate on the validation set

then, a conservative upper bound on $m$ is:

$$ \begin{equation*} \text{floor} \bigg( { \frac{1}{2} p' \exp(2 \epsilon^2 n) } \bigg). \end{equation*} $$

I'm not saying that this formula is useful in actual ML workflows. But it reveals a few things about how the maximum number of models you can evaluate scales wrt a few important quantities, which is useful info.

When does model selection begin to overfit?

2 Answers2