I'm building a predictive model with potentially multiple predictors. To that end, I try different, nested models, each with one more predictor than the previous one and compare their AICs. The AIC falls with each new predictor, but very slowly after the second one. Since the AIC is itself a random variable, I worry that a formally better model, where the AIC is lower by less than 0.5% than the previous one, is not truly better, but just a random effect.
So I thought I'd compare the models by bootstrapping. There are at least two ways I can think of:
- For each set of predictors, generate 1000 (or whatever) different bootstrap datasets, fit a model on each dataset and record its AIC. Plot the distribution of AICs over different set of predictors ('Full model' corresponds to 'AWFST' in the boxplot):
Or:
- For each set of predictors, train the model on the full dataset. Generate 1000 different bootstrap datasets, use each model to make predictions on each dataset and record its log-loss. Plot the distribution of log-losses over different set of predictors:
For better comparison, the same random seed was used in both approaches. As you can see, the results are quite similar, but not quite identical. Does any of the approaches make sense and, if yes, is one 'better' than the other? If not, where am I making a mistake?




AW*models having little difference with each other. – usεr11852 Mar 22 '23 at 22:55