2

According to Bishop's Pattern Recognition and Machine Learning on page 164,

on average the Bayes factor will always favour the correct model.

Given this, how can we use the Bayes factor in practice when we only have one dataset? When comparing multiple models, can we be sure that the model with the highest Bayes factor is closest to the true model? If not, does it even make sense to use the Bayes factor in practice for model comparison?

Gilles
  • 1,032
  • I do not have an answer for you, but I'm pretty certain that using Bayes Factors to compare/select models is discouraged now-a-days because the comparison is very sensitive to the priors you used. – Wayne Jan 26 '16 at 20:31
  • Discouraged by who? Not by e.g., Jim Berger. – innisfree Mar 10 '21 at 03:19

1 Answers1

2

First, I'm not sure about the quote. Let me denote by $$ B_{10} = \frac{P(D|M_1)}{P(D|M_0)} $$ and by $\langle \cdot \rangle_0$ an expectaion taken under model $M_0$ etc.

There is a trivial lemma, attributed sometimes to Alan Turing, that $$ \langle B_{10} \rangle_0 = \langle B_{01} \rangle_1 = 1, $$ so, no, it is not strictly true that the expected Bayes factor favors the correct model. But it is sort of morally true. The above result is misleading, as the distribution of Bayes factors would usually be peaked well below $1$ (favouring the correct model), but be quite heavy tailed (with extreme Bayes factors in the wrong direction leading to an average of one).


More detail on the above result, $$ \langle B_{10} \rangle_0 = \int p(D|M_0) \frac{P(D|M_1)}{P(D|M_0)} dD = \int P(D|M_1) dD = 1 $$


This in part happens because in the above expressions, Bayes factors that favor the correct model are between 0 and 1, whereas those that favor the wrong model may be between 1 and infinity.

So instead consider the log, which treats both cases more symmetrically, $$ \langle \log B_{10} \rangle_0 = \int p(D|M_0) \log \frac{P(D|M_1)}{P(D|M_0)} dD $$ What can we say about this? Well, by Gibbs' inequality we in fact can say $$ \langle \log B_{10} \rangle_0 \le 0 $$ and that the bound is saturated at 0 if and only if $P(D|M_1) = P(D|M_0)$ almost everywhere. This is the sort of result we anticipated - the expected log Bayes factor always favors the true model, except when the two models under consideration are the same, in which case it doesn't favor either.

Now, with that established, you seem to be asking, how can we use (log) Bayes factors, if they only indicate the correct model on average? i.e., if there are no guarantees that they always indicate the correct model?

Well, it seems rather churlish to me. We are trying to reason in light of uncertainty. We cannot eliminator that uncertainty! and must just live with the chance that we make a mistake. You could consider the plausibility of making a mistake in the case at hand and weigh up the risks with taking various courses of action (i.e., take a Bayesian decision theoretic approach), or try to control long-run error rates (i.e., a frequentist error theoretic approach). But in any case, you won't ever eliminate the possibility of making a mistake.

innisfree
  • 1,480