Normalizing constant in Bayes theorem

Question

I read that in Bayes rule, the denominator $\Pr(\textrm{data})$ of

$$\Pr(\text{parameters} \mid \text{data}) = \frac{\Pr(\textrm{data} \mid \textrm{parameters}) \Pr(\text{parameters})}{\Pr(\text{data})}$$

is called a normalizing constant. What exactly is it? What is its purpose? Why does it look like $\Pr(data)$? Why doesn't it depend on the parameters?

When you integrate $f(\text{data}|\text{params})f(\text{params})$, you are integrating over the parameters and so the result has no term depending on the parameters, in the same way that $\int_{x=0}^{x=2}xy;dx = 2y$ does not depend on $x$. — Henry, Jun 20 '11 at 18:57

score 22 · Accepted Answer · edited Mar 07 '18 at 17:59

22

The denominator, $\Pr(\textrm{data})$, is obtained by integrating out the parameters from the join probability, $\Pr(\textrm{data}, \textrm{parameters})$. This is the marginal probability of the data and, of course, it does not depend on the parameters since these have been integrated out.

Now, since:

$\Pr(\textrm{data})$ does not depend on the parameters for which one wants to make inference;
$\Pr(\textrm{data})$ is generally difficult to calculate in a closed-form;

one often uses the following adaptation of Baye's formula:

$\Pr(\textrm{parameters} \mid \textrm{data}) \propto \Pr(\textrm{data} \mid \textrm{parameters}) \Pr(\textrm{parameters})$

Basically, $\Pr(\textrm{data})$ is nothing but a "normalising constant", i.e., a constant that makes the posterior density integrate to one.

edited Mar 07 '18 at 17:59

answered Jun 20 '11 at 04:28

ocram

21,851

2

@nbro: I mean Pr(data) = integral over the parameters of Pr(data, parameters) – ocram Mar 08 '18 at 06:24
What do you mean by 'P(data) is generally difficult to calculate in a closed-form'? – unicorn Aug 18 '20 at 07:56
@unicorn: To calculate P(data), one has to integrate P(data, parameters) over the parameters. This task is generally difficult. – ocram Aug 19 '20 at 05:31

score 2 · Answer 2 · edited Mar 07 '18 at 18:04

2

When applying Bayes' rule, we usually wish to infer the "parameters" and the "data" is already given. Thus, $\Pr(\textrm{data})$ is a constant and we can assume that it is just a normalizing factor.

edited Mar 07 '18 at 18:04

answered Jun 20 '11 at 18:04

Harsh

343

score 1 · Answer 3 · answered Apr 16 '22 at 17:02

Most explanations of Bayes miss the mark. Consider the following for the role of Pr(B).

The crux of Bayes is the "update factor" $[Pr(B|A) / Pr(B)]$. This is the transformation applied to the prior.

If B always occurs in all states of the world, there is no information content & the update factor is 1.
In this case, $Pr(A|B) = Pr(A)$.

However, if B occurs frequently when A has occurred, but the overall probability of B occurring is very low, then there is high information content with respect to Pr(A).
The update factor will be HIGH and so $Pr(A|B) >> Pr(A)$.

For completeness, if B occurs rarely when A has occurred, but the overall probability of B occurring is very high, then there is also information content with respect to Pr(A), but in the opposite direction.
The update factor will be LOW and so $Pr(B|A) << Pr(A)$.

Purely mechanical explanations of Bayes seem to miss the genius of this simple equation.

Normalizing constant in Bayes theorem

3 Answers3

Linked

Related