Does Bootstrapping the Test Set Provide a Real Error Confidence Interval?

Question

My question here is a specific example of what was discussed in part in the answers of Bootstrapping test set? .

Suppose I train a model where I cannot mathematically derive a confidence interval for the error (think random forest or neural network), and further suppose that the test set and training set have roughly the same distribution of data for the target variable.

Would bootstrapping the test set and evaluating many times on it give a real confidence interval for the error of the model? Specifically, could I be X% certain that a random draw of the same size as the test set from the unsampled population will have an error within the interval?

What would be the alternative? Bootstrap both test and train sets (while making sure there is no data leakage) similar to k-fold Cross-Validation. I wonder how exactly bootstrapping only on the test set would fall short? We do not model the uncertainty of the model when we do not bootstrap the training set. I guess one can make a nice example for the conditional confidence interval of a conditional mean value (i.e. a model fitted with MSE) assuming a data generating process like $Y=X^3 + noise $... — Ggjj11, Feb 19 '24 at 21:14
It depends on what you mean by "real". A bootstrap interval is a real bootstrap interval, and in many circumstances that interval has properties very similar to those of a confidence interval. — Michael Lew, Feb 19 '24 at 21:22
+1 Michael, but in which cases would you want to additionally bootstrap the training and test data? And when is bootstrapping the test data "enough"? — Ggjj11, Feb 19 '24 at 21:44
@MichaelLew I suppose by "real" I mean this: if I took a sample of the population the same size as the test set and computed the error the model made on this new set, I could be X% certain the error would fall in the bootstrapped interval created using the test set. For the sake of the argument, let the test set be sufficiently large such that the size of the test set would make a minimal impact on the outcome. — Ryan Folks, Feb 20 '24 at 13:46
What sort of error are you talking about? For what sort of response variable is the error computed? — Sextus Empiricus, Feb 20 '24 at 14:09
What sort of resampling are you thinking about to compute a 'bootstrap', mixing the different values in the entire regressor matrix (but how do you know the related true values?) or only resampling the rows (but what would that achieve?). — Sextus Empiricus, Feb 20 '24 at 14:11
Here is a similar question and my response which may be helpful https://stats.stackexchange.com/questions/524863/simple-constant-width-prediction-interval-for-a-regression-model/526346#526346 — David Veitch, Feb 20 '24 at 16:42
It is possible to create approximate confidence intervals for neural networks by linearisation. Also for RF if you want a confidence interval on the error rate, then the out of bag estimator is already a bootstrap estimate from which you could get a bootstrap confidence interval. — Dikran Marsupial, Feb 25 '24 at 12:51

Sextus Empiricus · Answer 1 · 2024-02-20T14:26:59.293

Would bootstrapping the test set and evaluating many times

You do not need to bootstrap the test set. Your test set already contains multiple observations and you evaluate an error for each of them individually, giving you a distribution of errors. Based on that distribution of errors you can analyse a confidence interval for the average error.

If the error is a binary value, like in classification, then you can compute the average directly from the number of false and correct errors by using one of many methods to compute the confidence interval for a binomial distributed proportion. In this case bootstrapping would be unnecessary.

If the error is a scalar, then you can use a normal distribution approximation, which may work well even when your original distribution is not normal distributed, because the mean of a sample approaches a normal distribution (not if the distribution has problematic tails, but an empirical distribution doesn't often have this, and with problematic tails a bootstrap will not be accurate either).

Alternatively you can bootstrap the distribution of errors (but not the test set directly) to compute how a potential distribution of means of errors would look like.

score 3 · Answer 2 · answered Feb 22 '24 at 15:17

Based on the linked thread, I wonder if you might be interested in a prediction interval rather than a confidence interval. A prediction interval is one that contains the prediction error in, say, 90% of the instances (you could choose another fraction instead of 90%). In such a case, bootstrap is unnecessary. As Sextus Empiricus wrote in their answer, Your test set already contains multiple observations and you evaluate an error for each of them individually, giving you a distribution of errors. Thus, you can obtain a prediction interval as follows. First, obtain the errors on the test set. Second, calculate their 5% and 95% empirical quantiles. These are the end points of your 90% prediction interval.

The interval is symmetric, as it leaves out 5% of the lowest and 5% of the highest errors. You may find that an asymmetric interval is shorter; e.g. check out the pairs of empirical quantiles whose levels are 90% apart such as 4% and 94%, 3% and 93%, etc., 6% and 96%, 7% and 97% etc. Perhaps some of these intervals will be shorter than the one based on 5% and 95% quantiles, so if you want the shortest interval for a given coverage percentage, you can choose that.

For a relevant keyword, see conformal prediction.

Does Bootstrapping the Test Set Provide a Real Error Confidence Interval?

2 Answers2