9

This answer (currently 89 upvotes) states:

  • AIC is best for prediction as it is asymptotically equivalent to cross-validation.
  • BIC is best for explanation as it is allows consistent estimation of the underlying data generating process.

However Model Selection and Multimodel Inference by Burnham and Anderson seems to me to contradict that:

The assumed purpose of the BIC-selected model was often simple prediction; as opposed to scientific understanding of the process or system under study.

Is there any conceptual reason to prefer one of AIC, BIC to the other for a) prediction and b) explanation? Are the two quotes actually inconsistent, or is there some way to reconcile them?

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Mohan
  • 865
  • I would be quite surprised if, in spite of these theoretical properties, any 'explanation' or 'scientific understanding' has been hindered by a specific choice of information criteria. – Kuku Nov 23 '23 at 10:50
  • The problem is due to fixed penalties in fitting procedures with respect to sample size. You may look at this paper for variable learning: https://arxiv.org/pdf/0807.1005.pdf – Cagdas Ozgenc Nov 24 '23 at 14:47
  • 3
    I doubt that "explanation" and particularly "quality of explanation" are uniquely defined. So what is "best for explanation" would be relative to what exactly is required, and the word "explanation" isn't precise enough to define it. – Christian Hennig Nov 24 '23 at 23:27
  • 1
    There is no contradiction whatsoever between the two statements; the first one is comparative and the second one isn't. (That said both of these can be controversial, but not because they'd contradict each other.) – Christian Hennig Nov 24 '23 at 23:29
  • 1
    Also note that particularly the "BIC is best for explanation" statement assumes that the assumed model is true, which in reality is never the case. (Note that my comments may be seen as implying that the AIC is better as I'm criticising a specific argument against the BIC, but I have seen many situations in which I have preferred the BIC because the AIC delivered a model clearly too big for the purpose of the study. So I do not in general prefer the AIC, although in situations where only prediction is of interest, I do.) – Christian Hennig Nov 24 '23 at 23:30
  • I gained a much better understanding into this after watching 'Model Selection and the Cult of AIC' by Mark Brewer. https://www.youtube.com/watch?v=lEDpZmq5rBw – Mohan Nov 29 '23 at 13:32

2 Answers2

1

In this case, B&A are saying that people have used BIC to select a prediction model, even though this isn't what BIC actually optimises for.

Importantly, though, AIC and BIC will agree in the overwhelming majority of cases. This makes sense, because the model you should use for prediction is often (but not always, see the comments below) the model you believe is most likely to be the true one. This means that using BIC to select a prediction model isn't a huge problem.

In the cases where AIC and BIC disagree, it's generally because you have a lot of data. This can produce a situation where even though the simpler model is probably the true one (according to BIC), it's a close enough thing and the large dataset means that overfitting won't be a problem for the more flexible model, so you would expect it to make better predictions.

Eoin
  • 8,997
  • 3
    This is good overall, but I disagree that the true (or truer) model will necessarily be better for prediction. – gung - Reinstate Monica Nov 23 '23 at 20:31
  • @gung-ReinstateMonica, we may want a simpler model than the true one to reduce variance (at the expense of bias), but AIC would yield a larger one than the BIC. And AIC is supposed to be asymptotically optimal for prediction while BIC is a consistent selection criterion. This apparent paradox has caused me some confusion (as in this quesion) years ago that I never managed to figure out completely. – Richard Hardy Nov 24 '23 at 07:50
  • 1
    @RichardHardy, what I mean is just that prediction is a different activity, w/ different goals. An easy example is that your best route to predict a variable is from its effects, rather than its cause, even though that's 'backwards' in reality. There can be lots of ways that the true model will not be best for prediction. – gung - Reinstate Monica Nov 24 '23 at 12:59
  • @gung-ReinstateMonica, I think that in the context of information criteria, the concept of the true model does not involve the notion of causality in any way. It is a probabilistic model where the dependent variable is a random variable the distribution of which is determined by some explanatory variables. – Richard Hardy Nov 24 '23 at 13:26
  • @RichardHardy There is no paradox, AIC risk is higher on small samples (which you can see in my answer to the question you are referring to). In literature they are comparing asymptotic risk, which is kind of stupid because we actually care about small sample risk. I would always use a corrected version of AIC, or variable learning rate as Grunwald suggests. – Cagdas Ozgenc Nov 24 '23 at 14:40
  • @gung-ReinstateMonica This is a terrible misunderstanding, because definition of "model" in literature is a composite hypothesis (i.e. values not assigned to parameters). When you would define a model as a point hypothesis (i.e. values assigned to parameters), then in fact the true model always yields the best prediction. – Cagdas Ozgenc Nov 24 '23 at 14:42
  • This is not an issue to be debated in comments. I'll leave the discussion here. – gung - Reinstate Monica Nov 24 '23 at 20:30
  • I've clarified my response, and don't think there's any real disagreement here. I maybe should have stopped after my first paragraph, since there's already plenty on this site on this topic. – Eoin Nov 24 '23 at 21:44
  • 2
    "Importantly, though, AIC and BIC will agree in the overwhelming majority of cases." Do you have any evidence for this claim? I doubt it. I have seen them disagreeing in lots of situations, and for sure to back up a claim like this you'd need to define the set of cases to be "counted" properly, and then one would need to look at a random or representative of such cases to see whether an "overwhelming majority" can be nailed down with some confidence. – Christian Hennig Nov 24 '23 at 23:19
  • I meant to write "at a random or representative sample of such cases". – Christian Hennig Nov 24 '23 at 23:35
  • "the model you should use for prediction is often (...) the model you believe is most likely to be the true one" - I advise against believing any model to be true. – Christian Hennig Nov 24 '23 at 23:38
  • @Christian Hennig, these are all totally valid points, but I think your bar for how theoretically nuanced responses need to be on the stats section of stackoverflow is unreasonably high. – Eoin Nov 25 '23 at 08:48
  • 1
    Well, I didn't downvote, so I'm not saying your response overall is bad. Anyway, I think it's really relevant whether we should normally expect AIC and BIC to agree (you say "importantly" yourself), and as I wrote, my experience doesn't really agree with what you say there. Chances are both of us have seen "samples" of problems that cannot really be said to be "representative". Anyway, I'd be honestly curious regarding your experiential basis for saying this because it will surely depend on many characteristics of problem and data. – Christian Hennig Nov 25 '23 at 10:39
1

Is there any conceptual reason to prefer one of AIC, BIC to the other for a) prediction and b) explanation?

Both AIC and BIC are describing the model's fit for the training data, but more like a goodness-of-fit manner than actual prediction accuracy. As a model is always bound to overfit on its training data, I find these metrics irrelevant for the question of prediction. That is, I would not use any of them when choosing a model based on its predictive performance. For this type of consideration we can compare different metrics (MSE in regression, Brier/log in classification). See Section 2.4 here for a discussion about comparing Brier scores in logistic regression models.

So we're left with explanation. I'm not a bigshot Bayesian, but have had my courses. In Bayesian perspective, both metrics (they are very similar, differ only by the type of penalty on feature number) are insufficient. There are 3 main reasons why:

  1. Feature number isn't necessarily a true reflection of model complexion
  2. No consideration of the prior
  3. No optimality

For dataset $D=\{x_i,y_i\}$ having $k$ features and $n$ samples we have $$BIC=\ln(n)\cdot k-2\ln P(D|\theta^{MLE})$$ (in AIC it's $2k$ instead of $\ln(n)\cdot k$). Assuming normality ($y=X^T\theta+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)$ and denoting model prediction for point $x_i$ as $f_\theta(x_i)$, if we develop $P(D|\theta^{MLE})$ we get

$$P(D|\theta^{MLE})=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left( -\frac{1}{2\sigma^2}(y_i-f_\theta(x_i))^2 \right)$$

that is,

$$BIC=\ln(n)\cdot k+\frac{1}{\sigma^2}\sum_i(y_i-f_\theta(x_i))^2+C$$

Note that if a simple model is nested inside a more complex one, the sum would always be smaller for the more complex model, which reflects an inherent preference for complex models.

In the Bayesian way of life, we consider the evidence function $P(D)$ for comparing "explanations" (quotes as explanations are a real branch of AI model explainability). Under the same settings as before and considering a normal prior $\theta\sim\mathcal{N}(\mu_\theta,\Sigma_\theta)$, the evidence function would be

$$P(D)=P(\theta^{MAP})P(D|\theta^{MAP})|\Sigma_{\theta|D}|^\frac{1}{2}(2\pi)^\frac{n}{2}$$

where $$\Sigma_{\theta|D}=\left(\frac{1}{\sigma^2}X^TX+\Sigma_\theta^{-1}\right)^{-1}, \quad \theta^{MAP}=\left(\frac{1}{\sigma^2}X^TX+\Sigma_\theta^{-1}\right)^{-1}X^Ty$$

$P(D|\theta^{MAP})$ is like the above definition and $P(\theta^{MAP})$ is the prior probability.

This product can be decomposed into two parts:

  • The $P(D|\theta^{MAP})$ term which prefers complex models (as explained before)
  • The product $P(\theta^{MAP})|\Sigma_{\theta|D}|^\frac{1}{2}(2\pi)^\frac{n}{2}$ which is known as Occam's factor or Bayesian Occam's razor, and prefers simple models

Although I'm identifying as a frequentist (never as Bayesian), there's a more comprehensive view in the evidence function (compared to AIC/BIC).

Spätzle
  • 3,870
  • 1
    As a model is always bound to overfit on its training data, I find these metrics irrelevant for the question of prediction. AIC is specifically derived as the expected value of twice the negative log-likelihood on a new data point from the same population. The penalty term in AIC takes care of overfitting precisely. If your evaluation loss function is negative log-likelihood, this is all you need. (Otherwise, things may get more complicated; see this question.) Thus, I find AIC quite helpful in model selection for prediction. – Richard Hardy Nov 26 '23 at 15:40
  • I wonder if a similar results could be derived for BIC (under different assumptions), given its asymptotic equivalence to $k$-fold cross validation for a certain fold size (again, under some assumptions). – Richard Hardy Nov 26 '23 at 15:43