23

I read that in Bayes rule, the denominator $\Pr(\textrm{data})$ of

$$\Pr(\text{parameters} \mid \text{data}) = \frac{\Pr(\textrm{data} \mid \textrm{parameters}) \Pr(\text{parameters})}{\Pr(\text{data})}$$

is called a normalizing constant. What exactly is it? What is its purpose? Why does it look like $\Pr(data)$? Why doesn't it depend on the parameters?

amateur
  • 384
  • 6
    When you integrate $f(\text{data}|\text{params})f(\text{params})$, you are integrating over the parameters and so the result has no term depending on the parameters, in the same way that $\int_{x=0}^{x=2}xy;dx = 2y$ does not depend on $x$. – Henry Jun 20 '11 at 18:57

3 Answers3

22

The denominator, $\Pr(\textrm{data})$, is obtained by integrating out the parameters from the join probability, $\Pr(\textrm{data}, \textrm{parameters})$. This is the marginal probability of the data and, of course, it does not depend on the parameters since these have been integrated out.

Now, since:

  • $\Pr(\textrm{data})$ does not depend on the parameters for which one wants to make inference;
  • $\Pr(\textrm{data})$ is generally difficult to calculate in a closed-form;

one often uses the following adaptation of Baye's formula:

$\Pr(\textrm{parameters} \mid \textrm{data}) \propto \Pr(\textrm{data} \mid \textrm{parameters}) \Pr(\textrm{parameters})$

Basically, $\Pr(\textrm{data})$ is nothing but a "normalising constant", i.e., a constant that makes the posterior density integrate to one.

ocram
  • 21,851
  • 2
    @nbro: I mean Pr(data) = integral over the parameters of Pr(data, parameters) – ocram Mar 08 '18 at 06:24
  • What do you mean by 'P(data) is generally difficult to calculate in a closed-form'? – unicorn Aug 18 '20 at 07:56
  • @unicorn: To calculate P(data), one has to integrate P(data, parameters) over the parameters. This task is generally difficult. – ocram Aug 19 '20 at 05:31
2

When applying Bayes' rule, we usually wish to infer the "parameters" and the "data" is already given. Thus, $\Pr(\textrm{data})$ is a constant and we can assume that it is just a normalizing factor.

Harsh
  • 343
1

Most explanations of Bayes miss the mark. Consider the following for the role of Pr(B).

The crux of Bayes is the "update factor" $[Pr(B|A) / Pr(B)]$. This is the transformation applied to the prior.

If B always occurs in all states of the world, there is no information content & the update factor is 1.
In this case, $Pr(A|B) = Pr(A)$.

However, if B occurs frequently when A has occurred, but the overall probability of B occurring is very low, then there is high information content with respect to Pr(A).
The update factor will be HIGH and so $Pr(A|B) >> Pr(A)$.

For completeness, if B occurs rarely when A has occurred, but the overall probability of B occurring is very high, then there is also information content with respect to Pr(A), but in the opposite direction.
The update factor will be LOW and so $Pr(B|A) << Pr(A)$.

Purely mechanical explanations of Bayes seem to miss the genius of this simple equation.