Cross Validation: Averaging across estimates vs re-estimating on full sample

Question

Suppose you perform cross-validation to obtain an optimal value for some vector of hyperparameters $\lambda$.

You ultimately want to predict some new observations $y_\mathrm{query}|X_\mathrm{query}$.

It seems that you have at least three choices on how to proceed:

Estimate the model parameters $\hat\theta_i$ on each cross-validation training sample $i=1,\dots,n$, with the optimal $\lambda$, then average these values to obtain a final estimate, $\hat{\bar\theta}:=\frac{1}{n}\sum_{i=1}^n{\hat\theta_i}$. Use these averaged estimates $\hat{\bar\theta}$ to perform the required prediction, $\hat y_{\mathrm{query},\hat{\bar\theta}}:=\mathbb{E}[y_\mathrm{query}|X_\mathrm{query},\theta=\hat{\bar\theta}]$.
Estimate the required predictions $\hat y_{\mathrm{query},i}$ on each cross-validation training sample $i=1,\dots,n$, with the optimal $\lambda$, then average these values to obtain a final prediction, $\hat {\bar y}_\mathrm{query}:=\frac{1}{n}\sum_{i=1}^n{\hat y_{\mathrm{query},i}}$.
Using the optimal $\lambda$, re-estimate the model on the entire sample, to obtain $\hat\theta_*$. Use these parameters to perform the required prediction, $\hat y_{\mathrm{query},\hat\theta_*}:=\mathbb{E}[y_\mathrm{query}|X_\mathrm{query},\theta=\hat\theta_*]$.

Which of these methods is most common? What are their advantages and disadvantages?

score 10 · Answer 1 · answered Jun 15 '20 at 08:51

10

This is uncommon and can be quite risky, especially when there are multiple local minima for the cost function. You might even end up in the neighbourhood of a local maximum after averaging them out!
This is similar (not the same since subsets are not bootstrap samples) to ensembling (e.g. random forests). Not very common, and not preferable when training is expensive. It might be useful when you want to reduce the variance of your model, but there are other ways to reduce it as well.
This is the most common approach. Preferable since in the end you can claim one model and it'll be cheaper than training $K$ models. Moreover, in this one, your model sees the whole data together.

answered Jun 15 '20 at 08:51

gunes

Isn't one argument for 1 or 2 that your optimized hyperparameters will only be optimal for datasets the size of your CV training sample. The full dataset could be much larger, depending on how many folds you have. – cfp Jun 17 '20 at 09:00
1

If number of folds increases, training and validation sets come closer in size. The largest difference is for K=2, where it is halved. But, the point that the datasets differ is valid especially for (1). – gunes Jun 17 '20 at 10:58
@cfp is the answer clear? – gunes Jun 23 '20 at 15:48
Yes, thanks. I only didn't tick it as it seemed like there was potentially scope for more to be said, so I didn't want to put off other contributors. – cfp Jun 23 '20 at 16:13

1 Answers1