Is there any conceptual reason to prefer one of AIC, BIC to the other
for a) prediction and b) explanation?
Both AIC and BIC are describing the model's fit for the training data, but more like a goodness-of-fit manner than actual prediction accuracy. As a model is always bound to overfit on its training data, I find these metrics irrelevant for the question of prediction. That is, I would not use any of them when choosing a model based on its predictive performance. For this type of consideration we can compare different metrics (MSE in regression, Brier/log in classification). See Section 2.4 here for a discussion about comparing Brier scores in logistic regression models.
So we're left with explanation. I'm not a bigshot Bayesian, but have had my courses. In Bayesian perspective, both metrics (they are very similar, differ only by the type of penalty on feature number) are insufficient. There are 3 main reasons why:
- Feature number isn't necessarily a true reflection of model
complexion
- No consideration of the prior
- No optimality
For dataset $D=\{x_i,y_i\}$ having $k$ features and $n$ samples we have
$$BIC=\ln(n)\cdot k-2\ln P(D|\theta^{MLE})$$
(in AIC it's $2k$ instead of $\ln(n)\cdot k$). Assuming normality ($y=X^T\theta+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)$ and denoting model prediction for point $x_i$ as $f_\theta(x_i)$, if we develop $P(D|\theta^{MLE})$ we get
$$P(D|\theta^{MLE})=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left( -\frac{1}{2\sigma^2}(y_i-f_\theta(x_i))^2 \right)$$
that is,
$$BIC=\ln(n)\cdot k+\frac{1}{\sigma^2}\sum_i(y_i-f_\theta(x_i))^2+C$$
Note that if a simple model is nested inside a more complex one, the sum would always be smaller for the more complex model, which reflects an inherent preference for complex models.
In the Bayesian way of life, we consider the evidence function $P(D)$ for comparing "explanations" (quotes as explanations are a real branch of AI model explainability). Under the same settings as before and considering a normal prior $\theta\sim\mathcal{N}(\mu_\theta,\Sigma_\theta)$, the evidence function would be
$$P(D)=P(\theta^{MAP})P(D|\theta^{MAP})|\Sigma_{\theta|D}|^\frac{1}{2}(2\pi)^\frac{n}{2}$$
where
$$\Sigma_{\theta|D}=\left(\frac{1}{\sigma^2}X^TX+\Sigma_\theta^{-1}\right)^{-1}, \quad \theta^{MAP}=\left(\frac{1}{\sigma^2}X^TX+\Sigma_\theta^{-1}\right)^{-1}X^Ty$$
$P(D|\theta^{MAP})$ is like the above definition and $P(\theta^{MAP})$ is the prior probability.
This product can be decomposed into two parts:
- The $P(D|\theta^{MAP})$ term which prefers complex models (as
explained before)
- The product
$P(\theta^{MAP})|\Sigma_{\theta|D}|^\frac{1}{2}(2\pi)^\frac{n}{2}$
which is known as Occam's factor or Bayesian Occam's razor, and prefers simple models
Although I'm identifying as a frequentist (never as Bayesian), there's a more comprehensive view in the evidence function (compared to AIC/BIC).