Practical justification for not basing model selection on performance on test data?

Question

I've always been told not to use a model's performance on the test data to select a final model. I've read the responses to this question and others posted around the internet, but still have lingering doubts. While I can understand this theoretically in the context of "test data is unseen data and cannot influence decisions about model construction," I often find this challenging to wrap my head around in practice.

For example, suppose I have model A. I train it and get a validation score of 0.70. Then I have model B, which is trained on data that contains 2 new engineered features. My validation score for this model is 0.72.

I'm told to select model B based on its better validation score, so I do. When I evaluate on the test data, I get a test score of 0.67.

Out of curiosity, I check how model A performs on the test data. It scores 0.69.

What is the benefit of selecting model B in this case, when I know that model A performs better on the test data? If I were to deploy one of these models, should I really be sticking with B?

Demetri Pananos · Answer 1 · 2022-04-05T23:35:09.863

What is the benefit of selecting model B in this case

Assume both models' RMSE is 60 (appropriately scaled) units and that the sampling distribution for the RMSE is normal. Let's assume you select a model by applying both on the test set selecting the model with the smallest error and then use the test performance as the out of sample performance estimate.

Ideally, the out of sample estimate is unbiased, meaning over the lifetime of the model and assuming no concept drift, this process has 0 bias for the true out of sample estimate.

Let's use R to see what would happen using your strategy

N = 10000
errs = rnorm(2*N, 60, 1)
errs = matrix(errs, ncol = 2)
selected_estimate = apply(errs,1,min)
hist(selected_estimate)

Look, estimate for out of sample performance is biased! An estimate of this bias is mean(selected_estimate) - 60 which is about -0.6 when I run my code. Is -0.6 big? The answer doesn't matter, the point is that this should be 0 and we can see it isn't.

This is what you risk by selecting on the test set. The test set is supposed to be an unbiased estimate of the performance once the model is selected. By selecting a model on the test set, you've either risked biasing out of sample performance or have basically thrown away the ability to estimate the performance in an unbiased fashion. You've also thrown out data you could have ostensibly trained on because you've not really made use of the validation set; you've used the test set as a validation set.

score 4 · Answer 2 · answered Apr 12 '22 at 06:04

The problem is one of over-fitting the model selection criterion, see my paper (with Mrs Marsupial)

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (pdf)

Essentially if we make model choices by minimising a model selection criterion based on a finite amount of data, then that minimisation may involve changes that produce genuine improvements in generalisation performance, but it can also involve changes that are exploiting the particular sample of data and do not make performance better (and may make it worse). This is analogous to over-fitting the training data, but at the next level up.

So one reason not to use the test data to select the model is that it will give an optimistically biased performance estimate. However, it may be that you don't actually need that for your application.

If you have to make choices between models that have a lot of hyper-parameters (and hence a greater risk of over-fitting the model selection criterion on the validation set), then you may want to use the test set (which will then be optimistically biased), or a second validation set, for choosing the final model. However, if the number of hyper-parameters is small, this is unlikely to be necessary and you can just use a single validation set. See

J Wainer, G Cawley, Nested cross-validation when selecting classifiers is overzealous for most practical applications, Expert Systems with Applications 182, 115222 (pdf)

chicxulub · Answer 3 · 2023-10-21T04:58:10.733

There is a theoretical justification for selecting a model with the best test set performance while also guaranteeing something about the selection procedure's performance. You just need a lot of test data and/or a higher tolerance for poorly estimating error. As a reference, see the relevant slides (mainly slide 52) in this deck. If you're familiar w/ statistics, this analysis is similar to applying the Bonferroni correction to confidence intervals.

Selection procedure

$m$ models are trained and tuned on the training set
Each of these models is evaluated on the test set
The model with the highest test set performance is selected.

Definitions

$n$ is the number of test set examples.
$m$ is the number of models trained on the training set.
$Y_i$ is the response/label/outcome r.v. for test set example $i$.
$\hat{Y}_{ij}$ is model $j$'s prediction r.v. for test set example $i$. Assume, as is usual, that model $j$ was trained, tuned, etc. on data independent of all $Y_i$'s.
$\text{err}$ is an error function. For classification problems, $\text{err}(Y, \hat{Y}) = I(Y = \hat{Y})$ is accuracy, and $\text{err}(Y, \hat{Y}) = (Y - \hat{Y})^2$ is squared error (valid for classification too). For this analysis, let's assume $\text{err}$'s range is $[0,1]$.
$Z_{ij} = \text{err}(Y_i, \hat{Y}_{ij})$ is an r.v. for the error of model $j$ on test set example $i$. A consequence of training on data independent of test data is that for each $j$ and all $i \neq i', Z_{ij} \perp Z_{i'j}$.
$\bar{Z}_j = \frac{1}{n} \sum_{i=1}^{n} Z_{ij}$ is an estimator of model $j$'s error.
$\mu_j = \text{E}(\bar{Z}_j)$ is the expected or "true" error of model $j$.
$\epsilon$ is the minimum acceptable difference between a model's estimated error and its true error. This quantity must be specified for the analysis to have any value.

Performance guarantee

The procedure above is ran once. The model $j^* = \text{argmin}_j \bar{Z}_j$ is selected. A quantity amenable to analysis is the probability that $j^*$'s error estimator misses $j^*$'s true error by more than $\epsilon$, i.e., $p := \Pr(|\bar{Z}_{j^*} - \mu_{j^*} | > \epsilon)$.

To break $p$ down, first note that the event, $|\bar{Z}_{j^*} - \mu_{j^*} | > \epsilon$, is a subset of the event that $\bigcup_{j=1}^{m} |\bar{Z}_{j} - \mu_{j} | > \epsilon$. Using this fact, we can bound $p$ by:

$$ \begin{align} p &= \Pr(|\bar{Z}_{j^*} - \mu_{j^*} | > \epsilon) \\ &\leq \Pr \Bigg( \bigcup_{j=1}^{m} |\bar{Z}_{j} - \mu_{j} | > \epsilon \Bigg) && \text{monotonicity} \\ &\leq \sum_{j=1}^{m} \Pr(|\bar{Z}_{j} - \mu_{j} | > \epsilon) && \text{union bound} \\ &\leq \sum_{j=1}^{m} 2\exp(-2 \epsilon^2 n) && \text{Hoeffding's inequality} \\ &= 2m\exp(-2 \epsilon^2 n). \end{align} $$

From now on refer to this upper bound on $p$ as $p'$.

Interpretation

Wrong: I have a dataset. I ran the procedure and computed the minimum test error estimate (0.69 in your case). The probability that this estimate misses my selected model's expected error by at least $\epsilon$ is (at most) $p'$.

Right: I ran this procedure on $100$ randomly sampled, independent datasets. Across these datasets, I expect the test set error estimate to miss the selected model's expected error by at least $\epsilon$ in (at most) $p' \cdot 100$ of them. This fact is the only one I can back up; I don't know for which datasets it missed, or how many times it missed. I guarantee nothing about the individual performance of my 100 selected models.

Here are things that this procedure does not allow for:

An unbiased estimator of model $j^*$'s performance. If you want this, get a bunch more independent data and evaluate model $j^*$ on it.
(As with almost all model evaluation procedures) observing the test set error estimate, training a new/corrected model, and then repeating this procedure hoping for the same guarantee. That's b/c once a model's predictions become dependent on the labels, Hoeffding's inequality cannot be applied. It may go w/o saying, but in general, avoid inducing any dependence between predictions and labels when evaluating a model or a selection procedure. So don't re-fit or re-tune models and then re-evaluate on the test set.
Unbounded $\text{err}$ functions. We need it to be bounded to apply Hoeffding's inequality. So regression problems will be excluded.

Example

I've decided I'm going to train at most $m = 10$ models. I've split my data so that $n = 10,000$. For my problem, I'll accept $\epsilon = 0.02$. (Note that the performance guarantee was 2-sided. There is a 1-sided version of this analysis if you're unconcerned with overestimating error, which could be ok in some applications.) If I ran the procedure 1000 times, then I'd expect the test set error estimate to be at least $\epsilon$-away from the model's true error in at most 7 cases. That's because $p' \approx 0.0067$.

You can also apply this idea to calculate the test set's sample size. It assumes you've pre-specified all of its arguments. Here's some R code for that:

hoeffding_sample_size = function(m, eps, p) {
    (log(2) + log(m) - log(p)) / (2*eps^2)
}

Simulation

Here's a link to a simulation which checks that the performance guarantee is valid for $\epsilon = 0.01, m = 2$.

Note that the bound is useless (for $\epsilon = 0.01, m = 2$) until the test set size gets past ~7k examples.

Should this procedure be used?

If you'd prefer a performance guarantee about a procedure b/c you're solving a whole bunch of prediction problems, and have a lot of test data for each, then maybe. But practitioners will almost always want an unbiased estimate of the selected model's out-of-sample error, rather than one about the procedure used to select it. That's why we prefer reserving an independent test set for a single evaluation of a single model.

Adapting the bound for adaptive procedures

Everything above only applies to this procedure: pre-specify $m$ models, train them, evaluate them, and pick the best one. This procedure is not as realistic as an adaptive one: train a model, observe its test set error, train an adjusted model, observe its test set error, repeat until you're satisfied or tired. A performance guarantee can still be made in this case, as discovered here:

Blum, Avrim, and Moritz Hardt. "The ladder: A reliable leaderboard for machine learning competitions." International Conference on Machine Learning. PMLR, 2015.

A core idea behind the Ladder Mechanism, in contrast to the plain bound $p'$, is that a model must significantly outperform previous ones in order for you to buy its test set error estimate.

score 0 · Answer 4 · answered Jun 06 '23 at 19:44

Well, my answer is kind of different. Not that the other answers are wrong---I don't think they are. But you asked for a practical reason. Here are some. These apply to any automatic method.

It prevents you from thinking.
It tells your boss to pay you less.
It ignores some reasons to include a variable such as: 1. It could be an important covariate. 2. A small and nonsignificant effect might be interesting. 3. It might be your main variable of interest. 4. If you publish a paper without it, people might laugh (or the editor might reject).

You asked for practical, that's practical.