Selecting the model by bootstrapping: AIC vs. log-loss?

Question

I'm building a predictive model with potentially multiple predictors. To that end, I try different, nested models, each with one more predictor than the previous one and compare their AICs. The AIC falls with each new predictor, but very slowly after the second one. Since the AIC is itself a random variable, I worry that a formally better model, where the AIC is lower by less than 0.5% than the previous one, is not truly better, but just a random effect.

So I thought I'd compare the models by bootstrapping. There are at least two ways I can think of:

For each set of predictors, generate 1000 (or whatever) different bootstrap datasets, fit a model on each dataset and record its AIC. Plot the distribution of AICs over different set of predictors ('Full model' corresponds to 'AWFST' in the boxplot):

Or:

For each set of predictors, train the model on the full dataset. Generate 1000 different bootstrap datasets, use each model to make predictions on each dataset and record its log-loss. Plot the distribution of log-losses over different set of predictors:

For better comparison, the same random seed was used in both approaches. As you can see, the results are quite similar, but not quite identical. Does any of the approaches make sense and, if yes, is one 'better' than the other? If not, where am I making a mistake?

Then how are you penalizing the model for overfitting when it comes to log-loss? — Dave, Mar 22 '23 at 18:52
@Dave I don't. I thought that it would suffice to show that the log-losses of two adjacent models overlap by, say, > 5%. Is this thinking wrong? — Igor F., Mar 22 '23 at 18:57
That would make more sense for an out-of-sample metric, though it is comforting that the AIC, which does penalize model complexity, follows the same pattern. — Dave, Mar 22 '23 at 18:58
Do you have a reference for this "AIC per observation" please? I haven't seen it before. Also, you do not comment on it but you might want to consider looking at AICc if you do not have a very large sample size. Finlaly, you might want to look at the actually differences (see: https://stats.stackexchange.com/questions/232465/ for more on that). As it stands I read it as the AW* models having little difference with each other. — usεr11852, Mar 22 '23 at 22:55
I worry that a formally better model, where the AIC is lower by less than 0.5% than the previous one, is not truly better, but just a random effect. Why are you not worried about the opposite effect which is probably equally likely: that save for a random effect, drops are larger than 0.5%? If you think you can improve upon AIC, use the improved criterion. This criterion, however, suggests what it does, and I do not see a way around that. (I am afraid you might fool yourself by using ad-hoc adjustments.) — Richard Hardy, Mar 23 '23 at 07:33
@usεr11852 'AIC per observation' is my invention. I thought it'd be straightforward, but maybe I'm mistaken. I posted a different question on that topic: https://stats.stackexchange.com/q/610415/169343 — Igor F., Mar 23 '23 at 08:01
@RichardHardy I do also worry about the opposite effect. That's why I'm doing the bootstrapping. However, the results show almost identical distributions of the AIC for more complex models. — Igor F., Mar 23 '23 at 08:02
While bootstrapping tells you something about the variability in AIC, the vanilla AIC is still your best estimate (of twice the negative expected likelihood of a new observation), and therefore it makes sense to use vanilla AIC in model selection (unless you prefer worse estimates to the best estimate). — Richard Hardy, Mar 23 '23 at 08:09
I am with Richard on this (+1). Avoid inventing a wheel, that looks like an existing wheel but it is not the existing wheel. (The wheel here being AIC) — usεr11852, Mar 23 '23 at 11:24
@usεr11852 'AIC per observation' is what The Elements Of Statistical Learning refers to as the 'AIC' (p. 231, Eq. 7.29). Other sources, including software I'm familiar with, define AIC as that value multiplied by the dataset size $N$. — Igor F., Mar 23 '23 at 14:30
I don't know Igor; I have read Akaike's paper as well as the base R, MATLA, Python's Statsmodels and Julia's StatsBase.jl implementations that use the "standard" AIC/AICc calculations. I see now where you are coming from there. That's the normalised AIC in ESEL. Sure, it's not wrong just... pretty unusual. Thank you for clarifying. As mentioned in my first comment though, check the difference between the AIC values within the same seed (so the sample size is the same). — usεr11852, Mar 23 '23 at 15:59

Richard Hardy · Accepted Answer · 2023-04-14T08:18:54.973

AIC is optimal in a well defined sense under a certain set of assumptions. You seem to be looking for a reason not to use AIC. If so, might it be that the goal of your analysis does not justify AIC as a model selection criterion to begin with? Or might it be that it does, in which case you might be straying away from the optimal choice without a theoretically sound argument?

I worry that a formally better model, where the AIC is lower by less than 0.5% than the previous one, is not truly better, but just a random effect.

But are you not worried about the opposite effect which is probably equally likely: that save for a random effect, drops are larger than 0.5%? If you think you can improve upon AIC, use the improved criterion. This criterion, however, suggests what it does, and I do not see a way around that. I am afraid you might fool yourself by using ad-hoc adjustments.

Regarding bootstrapping, it tells you something about the variability in AIC. However, the vanilla AIC is still your best estimate (of $-2n\ \times$ the expected log-likelihood of a new observation), and therefore it makes sense to use vanilla AIC in model selection – unless you prefer worse estimates to the best estimate...

Selecting the model by bootstrapping: AIC vs. log-loss?

1 Answers1

Linked