3

Let us recall Bayes Theorem, where $\mathcal{D}$ is the data and $\theta$ is our parameter(s) of interest.

$ \operatorname{p}(\theta|\mathcal{D}) = \frac{{\operatorname{p}(\mathcal{D}, \theta)}}{\operatorname{p}(\mathcal{D})} = \frac{{\operatorname{p}(\mathcal{D}|\theta)}{\operatorname{p}(\theta)}}{\operatorname{p}(\mathcal{D})} $

We then have:

  • The conditional probability $\operatorname{p}(\theta|\mathcal{D})$. As a conditional probability, it is only a function of $\theta$, and $\mathcal{D}$ is assumed to be given.
  • The joint probability $\operatorname{p}(\mathcal{D}, \theta)$. As a joint probability, it is a function of two random variables, thus $\mathcal{D}$ is not given, but random.
  • The marginal probability $\operatorname{p}(\mathcal{D})$. As a marginal probability, it is a function of a random variable, thus $\mathcal{D}$ is not given, but random.

It seems that $\mathcal{D}$ has a dual-role. In the left side of the equation it is a constant, whereas in the right hand side it is a (vector of) random variable(s). Is the posterior a function of $\theta$ and $\mathcal{D}$ (as in the right-hand side) or is the posterior just a function of $\theta$ (as in the left-hand side)?


Auxiliary Questions

  1. I have omitted the case of the likelihood $\operatorname{p}(\mathcal{D}|\theta)$, but maybe it is at the core of what is happening. The likelihood is a function of $\theta$, once the data has been observed (https://stats.stackexchange.com/a/138707/180158). However, it is also used as a probability density function of the data given $\theta$. Otherwise, we would not use the chain rule to factorize the joint distribution $\operatorname{p}(\mathcal{D}, \theta)$ into the conditional and marginal distributions $\operatorname{p}(\mathcal{D}|\theta)\operatorname{p}(\theta)$. Does this dual-behavior depend on the presence of the other terms in Bayes theorem?

  2. In practice we tend to work only with the unnormalized (log) posterior (e.g. in Stan) : $\operatorname{p}(\theta|\mathcal{D}) \propto \operatorname{p}(\theta, \mathcal{D})$. Does the use of the unnormalized (log) posterior change anything regarding our terminology or assumptions of what $\mathcal{D}$ is?

Kuku
  • 1,452
  • Is the issue then about the notation abuse that conflates $\operatorname{p}(\theta | \mathcal{D})$ with $\operatorname{p}(\theta | \mathcal{D} = d)$? i.e. the first denotes a family of distributions of $\theta$, indexed by the data, whereas the latter denotes one specific distribution with parameters equal to $d$? In that case, the $\mathcal{D}$ on the left-hand side is also a random variable and there is no conflict? – Kuku Aug 03 '22 at 14:41
  • Likewise and related to the first auxiliary question, if data must be random and not given, why call $\operatorname{p}(\mathcal{D}|\theta$ a likelihood, if it is not a function of $\theta$ given some observed data? Does the data become fixed when we apply the theorem in a given problem and sample? But then, as you say, the conditional distribution would be undefined. – Kuku Aug 03 '22 at 14:50

1 Answers1

3

In the beginning, that is, at the Bayesian modelling stage, there are two random entities, $\mathcal D$ and $\theta$, with a joint distribution with density $p(\theta,\mathcal D)$. Then comes the observation of the data, which is a realisation $d$ of the random variable $\mathcal D$. The random variable $\theta$ remains random since it is not observed but the realisation $d$ brings some degree of information about $\theta$, which is one considers its distribution conditional on the realisation $d$ of $\mathcal D$, with density $p(\theta|\mathcal D=d)$.

Note that in standard statistical modelling, the data is usually considered as the observed realisation of the random variable, rather than as a random variable. To quote Wikipedia,

It is assumed that there is a "true" probability distribution induced by the process that generates the observed data.

Now, when considering Bayes' theorem, $$\operatorname{p}(\theta|\mathcal{D}) = \frac{{\operatorname{p}(\mathcal{D}, \theta)}}{\operatorname{p}(\mathcal{D})} = \frac{{\operatorname{p}(\mathcal{D}|\theta)}{\operatorname{p}(\theta)}}{\operatorname{p}(\mathcal{D})}$$ it is a mere mathematical (functional) identity linking the five density functions involved in it and should better be written $$\operatorname{p}(\theta|d) = \frac{{\operatorname{p}(d, \theta)}}{\operatorname{p}(d)} = \frac{{\operatorname{p}(d|\theta)}{\operatorname{p}(\theta)}}{\operatorname{p}(d)}\qquad\forall\,\theta,d$$ as it holds for all possible entries $\theta,d$.

From a mathematical perspective, $\operatorname{p}(\theta|d)$ is a function of both $\theta$ and $d$ since changing the value of $\theta$ or the value of $d$ modifies (in general) the value of $\operatorname{p}(\theta|d)$. The same remark applies to the likelihood function. What may be confusing to the OP is the use of the notation $d$ in this paragraph as a generic data realisation, varying in a sample space, $\mathfrak D$ say, as opposed to the actual observed data realisation also denoted $d$ in the first paragraph. Which is why the observed data is sometimes denoted otherwise, $d^o$ for instance. With such notations, $\operatorname{p}(\theta|d)$ is a function of both $\theta$ and $d$, while $\operatorname{p}(\theta|d^o)$ is a function of $\theta$ only.

Xi'an
  • 105,342
  • Thank you very much for your response. I am confused about a couple things still. First, the statement that $\operatorname{p}(\theta|d)$ is a function of both $\theta$ and $d$. Let us define a function $f$ such that $f(x) = 2x$. We could replace the number 2 by a variable such as $C$. Then we have $f(x|C=2) = 2x$. Isn't this just a function of $x$? It is true that changing $c$ would change the value of $f(x|C=c)$, but isn't that just because $f$ such that $f(x|C=3) = 3x$ is a different function from $f$ such that $f(x|C=2) = 2x$ (even if they are part of the same family indexed by C)? – Kuku Aug 04 '22 at 09:12
  • For example, this answer clearly states a conditional probability as a function of one argument: https://math.stackexchange.com/a/3296156/812938. An answer from you some years ago touches on the same subject (https://stats.stackexchange.com/a/373478/180158), but stating that what we condition on becomes random "for instance in a Bayesian analysis", suggesting that there might be a different function signature for conditional probability depending on the framework? – Kuku Aug 04 '22 at 09:19
  • A small follow-up question, what is the taxonomy of the stages mentioned in this answer? Is it dividing the analysis process into a first stage of modeling followed by a stage of inference? Are there other stages relevant to the transformations or realizations that D is subjected to during a (Bayesian) statistical analysis? – Kuku Aug 04 '22 at 09:24
  • Most likely it is an ignorance of mathematical notation convention on my side here, but what I feel is missing is the formal definition of the conditional probability that clearly shows the mapping from a family of distributions to a distribution in particular. e.g. Being aware of the distinction between $d$ and $d_0$, could we state that $p(\theta | d)$ is a function $f$ from $\mathbb{R}^2 \rightarrow \mathcal{F}d$, where $\mathcal{F}$ is the set of distributions indexed by $d$, such that $f(\theta, d) = \operatorname{p}{d}(\theta)$? – Kuku Aug 04 '22 at 09:41
  • In this manner, we make explicit that $f$ is a function of two arguments $\theta$ and $d$, whereas the implied probability distribution upon observation $\operatorname{p}_{d0}(\theta)$ is a function of only one parameter that is not fixed. – Kuku Aug 04 '22 at 09:43