25

I am trying to find a measure theoretic formulation of Bayes' theorem, when used in statistical inference, Bayes' theorem is usually defined as:

$$p\left(\theta|x\right) = \frac{p\left(x|\theta\right) \cdot p\left(\theta\right)}{p\left(x\right)}$$

where:

  • $p\left(\theta|x\right)$: the posterior density of the parameter.
  • $p\left(x|\theta\right)$: the statistical model (or likelihood).
  • $p\left(\theta\right)$: the prior density of the parameter.
  • $p\left(x\right)$: the evidence.

Now how would we define Bayes' theorem in a measure theoretic way?
So, I started by defining a probability space:

$$\left(\Theta, \mathcal{F}_\Theta, \mathbb{P}_\Theta\right)$$

such that $\theta \in \Theta$.
I then defined another probability space:

$$\left(X, \mathcal{F}_X, \mathbb{P}_X\right)$$

such that $x \in X$.
From here now on I don't know what to do, the joint probability space would be:

$$\left(\Theta \times X, \mathcal{F}_\Theta \otimes \mathcal{F}_X, ?\right)$$

but I don't know what the measure should be.
Bayes' theorem should be written as follow:

$$? = \frac{? \cdot \mathbb{P}_\Theta}{\mathbb{P}_X}$$

where:

$$\mathbb{P}_X = \int_{\theta \in \Theta} ? \space \mathrm{d}\mathbb{P}_\Theta$$

but as you can see I don't know the other measures and in which probability space they reside.
I stumbled upon this thread but it was of little help and I don't know how was the following measure-theoretic generalization of Bayes' rule reached:

$${P_{\Theta |y}}(A) = \int\limits_{x \in A} {\frac{{\mathrm d{P_{\Omega |x}}}}{{\mathrm d{P_\Omega }}}(y)\mathrm d{P_\Theta }}$$

I'm self-studying measure theoretic probability and lack guidance so excuse my ignorance.

Xi'an
  • 105,342
  • 3
    Bayes' Theorem is not about "prior", "posterior", "likelihood", "evidence". Bayes Theorem is about marginal and conditional probabilities. Later research mapped this theorem to the concepts you mention. – Alecos Papadopoulos Jan 10 '20 at 11:09

1 Answers1

36

One precise formulation of Bayes' Theorem is the following, taken verbatim from Schervish's Theory of Statistics (1995).

The conditional distribution of $\Theta$ given $X=x$ is called the posterior distribution of $\Theta$. The next theorem shows us how to calculate the posterior distribution of a parameter in the case in which there is a measure $\nu$ such that each $P_\theta \ll \nu$.

Theorem 1.31 (Bayes' theorem). Suppose that $X$ has a parametric family $\mathcal{P}_0$ of distributions with parameter space $\Omega$. Suppose that $P_\theta \ll \nu$ for all $\theta \in \Omega$, and let $f_{X\mid\Theta}(x\mid\theta)$ be the conditional density (with respect to $\nu$) of $X$ given $\Theta = \theta$. Let $\mu_\Theta$ be the prior distribution of $\Theta$. Let $\mu_{\Theta\mid X}(\cdot \mid x)$ denote the conditional distribution of $\Theta$ given $X = x$. Then $\mu_{\Theta\mid X} \ll \mu_\Theta$, a.s. with respect to the marginal of $X$, and the Radon-Nikodym derivative is $$ \tag{1} \label{1} \frac{d\mu_{\Theta\mid X}}{d\mu_\Theta}(\theta \mid x) = \frac{f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} $$ for those $x$ such that the denominator is neither $0$ nor infinite. The prior predictive probability of the set of $x$ values such that the denominator is $0$ or infinite is $0$, hence the posterior can be defined arbitrarily for such $x$ values.


Edit 1. The setup for this theorem is as follows:

  1. There is some underlying probability space $(S, \mathcal{S}, \Pr)$ with respect to which all probabilities are computed.
  2. There is a standard Borel space $(\mathcal{X}, \mathcal{B})$ (the sample space) and a measurable map $X : S \to \mathcal{X}$ (the sample or data).
  3. There is a standard Borel space $(\Omega, \tau)$ (the parameter space) and a measurable map $\Theta : S \to \Omega$ (the parameter).
  4. The distribution of $\Theta$ is $\mu_\Theta$ (the prior distribution); this is the probability measure on $(\Omega, \tau)$ given by $\mu_\Theta(A) = \Pr(\Theta \in A)$ for all $A \in \tau$.
  5. The distribution of $X$ is $\mu_X$ (the marginal distribution mentioned in the theorem); this is the probability measure on $(\mathcal{X}, \mathcal{B})$ given by $\mu_X(B) = \Pr(X \in B)$ for all $B \in \mathcal{B}$.
  6. There is a probability kernel $P : \Omega \times \mathcal{B} \to [0, 1]$, denoted $(\theta, B) \mapsto P_\theta(B)$ which represents the conditional distribution of $X$ given $\Theta$. This means that

    • for each $B \in \mathcal{B}$, the map $\theta \mapsto P_\theta(B)$ from $\Omega$ into $[0, 1]$ is measurable,
    • $P_\theta$ is a probability measure on $(\mathcal{X}, \mathcal{B})$ for each $\theta \in \Omega$, and
    • for all $A \in \tau$ and $B \in \mathcal{B}$, $$ \Pr(\Theta \in A, X \in B) = \int_A P_\theta(B) \, d\mu_\Theta(\theta). $$

    This is the parametric family of distributions of $X$ given $\Theta$.

  7. We assume that there exists a measure $\nu$ on $(\mathcal{X}, \mathcal{B})$ such that $P_\theta \ll \nu$ for all $\theta \in \Omega$, and we choose a version $f_{X\mid\Theta}(\cdot\mid\theta)$ of the Radon-Nikodym derivative $d P_\theta / d \nu$ (strictly speaking, the guaranteed existence of this Radon-Nikodym derivative might require $\nu$ to be $\sigma$-finite). This means that $$ P_\theta(B) = \int_B f_{X\mid\Theta}(x \mid \theta) \, d\nu(x) $$ for all $B \in \mathcal{B}$. It follows that $$ \Pr(\Theta \in A, X \in B) = \int_A \int_B f_{X \mid \Theta}(x \mid \theta) \, d\nu(x) \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $B \in \mathcal{B}$. We may assume without loss of generality (e.g., see exercise 9 in Chapter 1 of Schervish's book) that the map $(x, \theta) \mapsto f_{X\mid \Theta}(x\mid\theta)$ of $\mathcal{X}\times\Omega$ into $[0, \infty]$ is measurable. Then by Tonelli's theorem we can change the order of integration: $$ \Pr(\Theta \in A, X \in B) = \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x) $$ for all $A \in \tau$ and $B \in \mathcal{B}$. In particular, the marginal probability of a set $B \in \mathcal{B}$ is $$ \mu_X(B) = \Pr(X \in B) = \int_B \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x), $$ which shows that $\mu_X \ll \nu$, with Radon-Nikodym derivative $$ \frac{d\mu_X}{d\nu} = \int_\Omega f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta). $$
  8. There exists a probability kernel $\mu_{\Theta \mid X} : \mathcal{X} \times \tau \to [0, 1]$, denoted $(x, A) \mapsto \mu_{\Theta \mid X}(A \mid x)$, which represents the conditional distribution of $\Theta$ given $X$ (i.e., the posterior distribution). This means that
    • for each $A \in \tau$, the map $x \mapsto \mu_{\Theta \mid X}(A \mid x)$ from $\mathcal{X}$ into $[0, 1]$ is measurable,
    • $\mu_{\Theta \mid X}(\cdot \mid x)$ is a probability measure on $(\Omega, \tau)$ for each $x \in \mathcal{X}$, and
    • for all $A \in \tau$ and $B \in \mathcal{B}$, $$ \Pr(\Theta \in A, X \in B) = \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) $$

Edit 2. Given the setup above, the proof of Bayes' theorem is relatively straightforward.

Proof. Following Schervish, let $$ C_0 = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = 0\right\} $$ and $$ C_\infty = \left\{x \in \mathcal{X} : \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \infty\right\} $$ (these are the sets of potentially problematic $x$ values for the denominator of the right-hand-side of \eqref{1}). We have $$ \mu_X(C_0) = \Pr(X \in C_0) = \int_{C_0} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = 0, $$ and $$ \mu_X(C_\infty) = \int_{C_\infty} \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \, d\nu(x) = \begin{cases} \infty, & \text{if $\nu(C_\infty) > 0$,} \\ 0, & \text{if $\nu(C_\infty) = 0$.} \end{cases} $$ Since $\mu_X(C_\infty) = \infty$ is impossible ($\mu_X$ is a probability measure), it follows that $\nu(C_\infty) = 0$, whence $\mu_X(C_\infty) = 0$ as well. Thus, $\mu_X(C_0 \cup C_\infty) = 0$, so the set of all $x \in \mathcal{X}$ such that the denominator of the right-hand-side of \eqref{1} is zero or infinite has zero marginal probability.

Next, consider that, if $A \in \tau$ and $B \in \mathcal{B}$, then $$ \Pr(\Theta \in A, X \in B) = \int_B \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) \, d\nu(x) $$ and simultaneously $$ \begin{aligned} \Pr(\Theta \in A, X \in B) &= \int_B \mu_{\Theta \mid X}(A \mid x) \, d\mu_X(x) \\ &= \int_B \left( \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) \right) \, d\nu(x). \end{aligned} $$ It follows that $$ \mu_{\Theta \mid X}(A \mid x) \int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t) = \int_A f_{X \mid \Theta}(x \mid \theta) \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $\nu$-a.e. $x \in \mathcal{X}$, and hence $$ \mu_{\Theta \mid X}(A \mid x) = \int_A \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)} \, d\mu_\Theta(\theta) $$ for all $A \in \tau$ and $\mu_X$-a.e. $x \in \mathcal{X}$. Thus, for $\mu_X$-a.e. $x \in \mathcal{X}$, $\mu_{\Theta\mid X}(\cdot \mid x) \ll \mu_\Theta$, and the Radon-Nikodym derivative is $$ \frac{d\mu_{\Theta \mid X}}{d \mu_\Theta}(\theta \mid x) = \frac{f_{X \mid \Theta}(x \mid \theta)}{\int_\Omega f_{X \mid \Theta}(x \mid t) \, d\mu_\Theta(t)}, $$ as claimed, completing the proof.


Lastly, how do we reconcile the colloquial version of Bayes' theorem found so commonly in statistics/machine learning literature, namely, $$ \tag{2} \label{2} p(\theta \mid x) = \frac{p(\theta) p(x \mid \theta)}{p(x)}, $$ with \eqref{1}?

On the one hand, the left-hand-side of \eqref{2} is supposed to represent a density of the conditional distribution of $\Theta$ given $X$ with respect to some unspecified dominating measure on the parameter space. In fact, none of the dominating measures for the four different densities in \eqref{2} (all named $p$) are explicitly mentioned.

On the other hand, the left-hand-side of \eqref{1} is the density of the conditional distribution of $\Theta$ given $X$ with respect to the prior distribution.

If, in addition, the prior distribution $\mu_\Theta$ has a density $f_\Theta$ with respect to some (let's say $\sigma$-finite) measure $\lambda$ on the parameter space $\Omega$, then $\mu_{\Theta \mid X}(\cdot\mid x)$ is also absolutely continuous with respect to $\lambda$ for $\mu_X$-a.e. $x \in \mathcal{X}$, and if $f_{\Theta \mid X}$ represents a version of the Radon-Nikodym derivative $d\mu_{\Theta\mid X}/d\lambda$, then \eqref{1} yields $$ \begin{aligned} f_{\Theta \mid X}(\theta \mid x) &= \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x) \\ &= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) \frac{d \mu_{\Theta}}{d\lambda}(\theta) \\ &= \frac{d \mu_{\Theta \mid X}}{d\mu_\Theta}(\theta \mid x) f_\Theta(\theta) \\ &= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_{X\mid\Theta}(x\mid t) \, d\mu_\Theta(t)} \\ &= \frac{f_\Theta(\theta) f_{X\mid \Theta}(x\mid \theta)}{\int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t)}. \end{aligned} $$ The translation between this new form and \eqref{2} is $$ \begin{aligned} p(\theta \mid x) &= f_{\Theta \mid X}(\theta \mid x) = \frac{d \mu_{\Theta \mid X}}{d\lambda}(\theta \mid x), &&\text{(posterior)}\\ p(\theta) &= f_\Theta(\theta) = \frac{d \mu_\Theta}{d\lambda}(\theta), &&\text{(prior)} \\ p(x \mid \theta) &= f_{X\mid\Theta}(x\mid\theta) = \frac{d P_\theta}{d\nu}(x), &&\text{(likelihood)} \\ p(x) &= \int_\Omega f_\Theta(t) f_{X\mid\Theta}(x\mid t) \, d\lambda(t). &&\text{(evidence)} \end{aligned} $$

Artem Mavrin
  • 3,957
  • 1
    Why should $\Omega$ be a Borel space instead of some other measure space? – Dave Jan 09 '20 at 22:53
  • 4
    @Dave Borel spaces are easier to work with for technical reasons, while also being fairly general. For example, conditional distributions of random variables taking values in a Borel space always exist, whereas they might not exist for random variables taking values in a non-Borel space. Fortunately, most spaces in practice are Borel spaces. For example, every Borel subset of a complete, separable metric space is a Borel space. – Artem Mavrin Jan 09 '20 at 23:11
  • Hi @ArtemMavrin, thanks for the answer. I have two questions if you don't mind: $1)$ Can we know on which measurable spaces $\mu_{\Theta\mid X}$, $\mu_{X\mid \Theta}$, $\mu_{\Theta}$, and $\mu_{X}$ are measures on? $2)$ And is $t$ in $\int_\Omega f_{X\mid\Theta}(x\mid t) , d\mu_\Theta(t)$ just a variable taking values in the parameter space $\Omega$ including $\theta$? Thanks in advance. – Blg Khalil Jan 09 '20 at 23:38
  • @BlgKhalil please see the edit for 1). Regarding 2), $t$ is an arbitrary element of the parameter space $\Omega$. With Schervish's notation, the parameter is denoted $\Theta$ instead of $\theta$ – Artem Mavrin Jan 10 '20 at 00:07
  • 1
    I just checked the edit, your answer is extremely clear and detailed and helped me a lot, thank you very much for the time and effort @ArtemMavrin. – Blg Khalil Jan 10 '20 at 00:14
  • Hi @ArtemMavrin, I was just wondering, can $P_\theta(B)$ be written in the form $\mu_{X \mid \Theta}(B \mid \theta)$? thanks in advance. – Blg Khalil Jan 11 '20 at 20:45
  • 1
    @BlgKhalil yes, you could call it that if you want, and it would be more consistent with the rest of the notation – Artem Mavrin Jan 11 '20 at 21:16
  • Thank you, I greatly appreciate your help and clarification @ArtemMavrin. – Blg Khalil Jan 11 '20 at 21:18
  • 1
    @BlgKhalil glad to help :) – Artem Mavrin Jan 11 '20 at 21:19
  • @ArtemMavrin Is the version of Tonellis theorem you are using corresponding to what is often called Fubinis theorem for transition kernels (e.g Klenke theorem 14.29). But I always understood Fubinis theorem for transition kernels such that you cannot exchange the order of integration since this is never written down by any author. – guest1 Nov 28 '23 at 11:27
  • My naive justification for not being allowed to exchange the order of integration is that one measure is a marginal measure but the other is a conditional distribution. And when first integrating over the marginal in the inner integral then we would not integrate over the part of the transition kernel that is a measurable function, right? At least the notation would suggest that. – guest1 Nov 28 '23 at 11:27
  • 1
    @guest1 I am referring to this statement of Tonelli's theorem: https://en.wikipedia.org/wiki/Fubini%27s_theorem#Tonelli's_theorem_for_non-negative_measurable_functions – Artem Mavrin Nov 28 '23 at 17:40
  • Thanks for the answer! But in your case here, there are transition kernels involved right? Also do you know the fubini theorem for transition kernels? If so can you verify (or correct me) that it is not possible to change the order of integration there? – guest1 Nov 28 '23 at 18:10
  • 1
    @guest1 There are transition kernels involved in some parts of the answer, but in the part where Tonelli's theorem is used, the integrand is a plain nonnegative measurable function integrated against two nested measures. To change the order of integration when integrating against a transition kernel, you just need to be able to "disintigrate" the joint measure in the other order, as stange points out in their comment on your question Fubini's theorem for transition kernels – Artem Mavrin Nov 28 '23 at 22:54
  • thanks for your answer! But if $\nu$ is not a transition kernel but just a normal "marginal" distribution, then I dont understand the definition of the conditional density here. Since as far as I have always come across a conditional density, it is the radon nikodym derivative between two transition kernels, and not like here, one transition kernel and one marginal distribution. – guest1 Nov 29 '23 at 08:14
  • @ArtemMavrin So what you mean with your comment on disintegration is, that one can only change the order of integration when disintegration the joint measure instead of in $\mathbb{P}{Y|X}$ and $\mathbb{P}_X$ into $\mathbb{P}{X|Y}$ and $\mathbb{P}_Y$? Is this correct? But that then means that I always require that not just one but both conditional distributions exist right? And it also means that in fact we always have to integrate w.r.t the conditional distribution first, correct? – guest1 Nov 30 '23 at 08:38
  • 1
    Yes: you would need both conditional distributions to exist, and the inner integral will always be with respect to the conditional distribution – Artem Mavrin Dec 01 '23 at 03:39
  • @ArtemMavrin Another question: If $\nu$ is not a transition kernel but just a normal "marginal" distribution, then I dont understand the definition of the conditional density here. Since as far as I have always come across a conditional density, it is the radon nikodym derivative between two transition kernels, and not like here, one transition kernel and one marginal distribution – guest1 Dec 06 '23 at 12:57