Does the Bayesian posterior need to be a proper distribution?

Question

I know that priors need not be proper and that the likelihood function does not integrate to 1 either. But does the posterior need to be a proper distribution? What are the implications if it is/is not?

score 23 · Accepted Answer · edited Apr 13 '17 at 12:44

23

(It is somewhat of a surprise to read the previous answers, which focus on the potential impropriety of the posterior when the prior is proper, since, as far as I can tell, the question is whether or not the posterior has to be proper (i.e., integrable to one) to be a proper (i.e., acceptable for Bayesian inference) posterior.)

In Bayesian statistics, the posterior distribution has to be a probability distribution, from which one can derive moments like the posterior mean $\mathbb{E}^\pi[h(\theta)|x]$ and probability statements like the coverage of a credible region, $\mathbb{P}(\pi(\theta|x)>\kappa|x)$. If $$\int f(x|\theta)\,\pi(\theta)\,\text{d}\theta = +\infty\,,\qquad (1)$$ the posterior $\pi(\theta|x)$ cannot be normalised into a probability density and Bayesian inference simply cannot be conducted. The posterior simply does not exist in such cases.

Actually, (1) must hold for all $x$'s in the sample space and not only for the observed $x$ for, otherwise, selecting the prior would depend on the data. This means that priors like Haldane's prior, $\pi(p)\propto \{1/p(1-p)\}$, on the probability $p$ of a Binomial or a Negative Binomial variable $X$ cannot be used, since the posterior is not defined for $x=0$.

I know of one exception when one can consider "improper posteriors": it is found in "The Art of Data Augmentation" by David van Dyk and Xiao-Li Meng. The improper measure is over a so-called working parameter $\alpha$ such that the observation is produced by the marginal of an augmented distribution $$f(x|\theta)=\int_{T(x^\text{aug})=x} f(x^\text{aug}|\theta,\alpha)\,\text{d}x^\text{aug}$$ and van Dyk and Meng put an improper prior $p(\alpha)$ on this working parameter $\alpha$ in order to speed up the simulation of $\pi(\theta|x)$ (which remains well-defined as a probability density) by MCMC.

In another perspective, somewhat related to the answer by eretmochelys, namely a perspective of Bayesian decision theory, a setting where (1) occurs could still be acceptable if it led to optimal decisions. Namely, if $L(\delta,\theta)\ge 0$ is a loss function evaluating the impact of using the decision $\delta$, a Bayesian optimal decision under the prior $\pi$ is given by $$\delta^\star(x)=\arg\min_\delta \int L(\delta,\theta) f(x|\theta)\,\pi(\theta)\,\text{d}\theta$$ and all that matters is that this integral is not everywhere (in $\delta$) infinite. Whether or not (1) holds is secondary for the derivation of $\delta^\star(x)$, even though properties like admissibility are only guaranteed when (1) holds.

edited Apr 13 '17 at 12:44

Community

1

answered Dec 21 '14 at 08:57

Xi'an

105,342

What if $\int {\pi \left( \theta \right){\text{d}}\theta } = + \infty $, $\int {f\left( {\left. x \right|\theta } \right)\pi \left( \theta \right){\text{d}}\theta } = + \infty $ but $\frac{{\int {\theta f\left( {\left. x \right|\theta } \right)\pi \left( \theta \right){\text{d}}\theta } }}{{\int {f\left( {\left. x \right|\theta } \right)\pi \left( \theta \right){\text{d}}\theta } }} < + \infty $??? – Feb 02 '20 at 21:25
Something like a Feynman path integral maybe. – Feb 03 '20 at 20:19
I finally answered the question below. Please tell me whether you accept my counterexample or not. BR – Feb 04 '20 at 08:36
In fact, I'm struggling to interpret my own counterexample because it seems the situation is different a priori and a posteriori!? Indeed, suppose ${\mathbf{x}}$ has the same kind of prior distribution. By the same reasoning, we would conclude that in any case, it has a prior expectation. But this is NOT true. On the contrary, a priori we would rather say that it has NO expectation at all if, for instance, the covariance matrix is a singular matrix obtained from a finite difference scheme without (Dirichlet, Neumann) boundary conditions on the derivatives of function $x\left( t \right)$. – Feb 04 '20 at 09:09
We want to constrain the derivatives only, not the function itself. So, a priori we say that ${\mathbf{x}}$ has no expectation at all, but a posteriori we want to estimate it by computing its posterior expectation as described below! So, it is difficult to interpret, but that's how it works, definitely: would the prior covariance matrix be positive definite, ${\mathbf{x}}$ would have zero prior expectation and this would be undesirable from the regularization point of view. But it is singular and everything works perfectly. But it hopefully admit a well-defined posterior expectation! Weird. – Feb 04 '20 at 09:16
So, will you finally acknowledge or not that QM/QFT à la Feynman provides the most striking and well-known (counter)examples of the fact that a posterior does not need to be normalized at all in order to yield meaningful and useful posterior moments??? See my answer below for details, whose purpose was just to provide another, more elementary and useful counterexample. – Feb 05 '20 at 11:02
Thanks Prof. For sure, all that would deserve extensive discussions at the blackboard. Let me try another way please: stating that a posterior has to be proper in order to be proper is EXACTLY the same as stating that an underdetermined system of linear equations has no solutions. That's not true: the system actually has infinitely many solutions. In Bayesian nonparametrics, improper-proper posteriors precisely arise from underdetermined nonparametric models, typically additive models like $h\left( t \right) = f\left( t \right) + g\left( t \right)$. – Feb 06 '20 at 08:34
Those models are underdetermined because functions $f\left( t \right)$ and $g\left( t \right)$ are determined only up to additive constants. It follows that the joint posterior for the parameters $\left( {f\left( {{t_1}} \right),...,f\left( {{t_n}} \right),g\left( {{t_1}} \right),...,g\left( {{t_n}} \right)} \right)$ to be estimated is improprer, for instance a degenerate multivariate Gaussian with a positive semi-definite covariance matrix. – Feb 06 '20 at 08:40
But now, if you compute its posterior expectation as described below, given by its Moore-Penrose pseudoinverse, you will just get a particular solution/estimation of the functions $f\left( t \right)$ and $g\left( t \right)$, from which you can get all solutions upon request. But you typically don't care about those estimations because your goal was just to estimate the function $h\left( t \right)$. And it works perfectly. So if you ever want to deal with underdetermined, additive nonparametric models, you have to acknowledge that improper posterior nevertheless have proper expectations... – Feb 06 '20 at 08:48
... that just give you one particular solution/estimation among infinitely many of them, just like the Moore-Penrose inverse gives you one particular solution of an underdetermined system ${\mathbf{A}}x = b$. Is it more understandable now please? Improper-proper posteriors naturally arise from underdetermined models. – Feb 06 '20 at 08:50
As far as I can understand, in QM/QFT, improper-proper distributions arise for another reason: path integrals are infinite because they are infinite-dimensional, functional (e.g. Gaussian) integrals. But their ratios are NOT. – Feb 06 '20 at 09:02
Perhaps, I should better not have spent the last 20 years studying and applying probability theory alone in the industry. Improper-proper posteriors is just a starting point, from which there is a very, very deep algebraic theory of Bayesian nonparametric regularization to develop. Everything depends on the algebraic properties of the matrix pencil $\left( {{\mathbf{D}},{\mathbf{R}}} \right)$ where ${\mathbf{D}}$ is the data matrix and ${\mathbf{R}}$ is the regularization matrix. Three cases:... – Feb 06 '20 at 09:23
${\mathbf{D}}$ or ${\mathbf{R}}$ is positive definite. Easy. 2) None of them is positive definite but the matrix pencil $\left( {{\mathbf{D}},{\mathbf{R}}} \right)$ is regular. Solved 3) The matrix pencil $\left( {{\mathbf{D}},{\mathbf{R}}} \right)$ is singular. Hardcore but very exciting...

Feb 06 '20 at 09:25

Please forget what I say about the improper-proper priors, it's all the same: they also have prior expectations and moments, but they are just the those of one particular point (in an affine subspace of dimension $n$ if we constrain the $n$-th order derivative). You may like to use another name for those improper-proper moments, e.g. generalized or pseudo-moments, but I'm not aware of any standard terminology. – Feb 06 '20 at 17:08

score 23 · Answer 2 · answered Mar 07 '14 at 13:07

23

The posterior distribution need not be proper even if the prior is proper. For example, suppose $v$ has a Gamma prior with shape 0.25 (which is proper), and we model our datum $x$ as drawn from a Gaussian distribution with mean zero and variance $v$. Suppose $x$ is observed to be zero. Then the likelihood $p(x|v)$ is proportional to $v^{-0.5}$, which makes the posterior distribution for $v$ improper, since it is proportional to $v^{-1.25} e^{-v}$. This problem arises because of the wacky nature of continuous variables.

answered Mar 07 '14 at 13:07

Tom Minka

7,060

2

Cool example, Tom! – Zen Mar 07 '14 at 13:41
+1, though could you expand the answer to the OP's last sentence? Is this whacky posterior meaningful (can you do the kinds of things you'd usually do with a posterior), or is it more analogous to getting a NaN or Inf from some calculations? Is it a sign that something's wrong with your model? – Wayne Mar 07 '14 at 14:48
7

There's nothing wrong with the model. This posterior is meaningful in the sense that if you receive another observation, you can multiply it in and possibly get back to a proper posterior. So it's not like a NaN, upon which all further operations are NaN. – Tom Minka Mar 07 '14 at 14:59
9

Although this is probably too late to matter,I do not think using such "counter-examples" help beginners: the problem arises because you use a specific version of the Gaussian density at $x=0$, when it can be arbitrarily defined on this set of measure zero. And hence make the posterior proper or improper depending on the chosen version. – Xi'an Dec 20 '14 at 22:06
Interesting - if you take the general $x $, then the posterior is a generalised inverse gaussian with parameters $-0.25,1,x^2$. @Xi'an - it would be good to see the alternative way to get a proper posterior out of this. – probabilityislogic May 09 '16 at 11:11
@probabilityislogic: the posterior is proper if the Normal density $\varphi(x)$ at $x=0$ is defined $\varphi(0)=1$, which is perfectly legit from a measure theoretic point, since a density function is uniquely defined on a set of measure one, not everywhere. Since $X=0$ is an event of measure zero, substituting $1/\sqrt{2\pi v}$ by $1$ is acceptable. – Xi'an Apr 11 '21 at 17:30
@TomMinka: In your example, if we take the scale parameter to be 1 (for simplicity), then the prior would be $\pi(v) \propto v^{-0.75} e^{-v}$, so wouldn't the posterior be proportional to ${v}^{-1/2} \exp(-x^2/ (2 v)) \times v^{-0.75} e^{-v} = v^{-1.25} e^{-v - x^2/(2v)}$ ? – Leonidas Apr 09 '23 at 15:18
@Xi'an: That's an interesting point about redefining the density on a set of measure zero to make the posterior proper. Is this something that one might ever have to do in practice? Or is this more of a mathematical artifact that poses no "real-life" concern? – Leonidas Apr 09 '23 at 15:41

Zen · Answer 3 · 2016-11-16T23:16:52.563

12

Defining the set $$ \text{Bogus Data} = \left\{ x:\int f(x\mid \theta)\,\pi(\theta)\,d\theta = \infty \right\} \, , $$ we have $$ \mathrm{Pr}\left(X\in\text{Bogus Data}\right) = \int_\text{Bogus Data} \int f(x\mid \theta)\,\pi(\theta)\,d\theta\,dx = \int_\text{Bogus Data} \infty\,dx \, . $$ The last integral will be equal to $\infty$ if the Lebesgue measure of $\text{Bogus Data}$ is positive. But this is impossible, because this integral gives you a probability (a real number between $0$ and $1$). Hence, it follows that the Lebesgue measure of $\text{Bogus Data}$ is equal to $0$, and, of course, it also follows that $\mathrm{Pr}\left(X\in\text{Bogus Data}\right)=0$.

In words: the prior predictive probability of those sample values that make the posterior improper is equal to zero.

Moral of the story: beware of null sets, they may bite, however improbable it may be.

P.S. As pointed out by Prof. Robert in the comments, this reasoning blows up if the prior is improper.

edited Nov 16 '16 at 23:16

answered Mar 07 '14 at 13:37

Zen

24,121

4

You once wrote: "If we can start with a proper prior and get an improper posterior, then I will quit inference." – Tom Minka Mar 07 '14 at 14:23
2

A little bit tongue in cheek, there was an implicit quantifier: If we can start with a proper prior and get an improper posterior, for every possible sample value, then I will quit inference. ;-) – Zen Mar 07 '14 at 15:39
By the way, remarkable memory, Tom! – Zen Mar 07 '14 at 22:01
4

@Zen: I think there is a problem with your reasoning in that you assume that $\mathrm{Pr}\left(X\in\text{Bogus Data}\right)$ is a probability, hence that the joint measure on $(\theta,x)$ is a probability measure, which implies that the prior has to be a (proper) probability measure. – Xi'an Dec 20 '14 at 22:15
1

You're correct. The reasoning in the answer only works with proper priors. Good point. I'll add a note. – Zen Dec 20 '14 at 22:57
@Zen: When you say "the prior predictive probability of those sample values...", do you really mean the marginal probability of those sample values? My understanding was that for a set $E$, the integral $\int_E m(x),dx$ (where $m(x) := \int_{\Theta} f(x|\theta) \pi(\theta) ,d\theta$) is the marginal probability of the event ${X \in E }$. – Leonidas Apr 09 '23 at 15:59

score 4 · Answer 4 · edited Mar 07 '14 at 14:57

4

Any "distribution" must sum (or integrate) to 1. I can think a few examples where one might work with un-normalized distributions, but I am uncomfortable ever calling anything which marginalizes to anything but 1 a "distribution".

Given that you mentioned Bayesian posterior, I bet your question might come from a classification problem of searching for the optimal estimate of $x$ given some feature vector $d$

$$ \begin{align} \hat{x} &= \arg \max_x P_{X|D}(x|d) \\ &= \arg \max_x \frac{P_{D|X}(d|x) P_X(x)}{P_D(d)} \\ &= \arg \max_x {P_{D|X}(d|x) P_X(x)} \end{align} $$

where the last equality comes from the fact that $P_D$ doesn't depend on $x$. We can then choose our $\hat{x}$ exclusively based on the value $P_{D|X}(d|x) P_X(x)$ which is proportional to our Bayesian posterior, but do not confuse it for a probability!

edited Mar 07 '14 at 14:57

gung - Reinstate Monica

145,122

answered Mar 07 '14 at 14:23

eretmochelys

161

1

@Zen would you mind being more explicit about what you think is wrong (or fundamentally incomplete) about this answer? – whuber Mar 07 '14 at 21:20
2

One way to interpret the OP question "does the posterior need to be a proper distribution?" is to ask if it is mathematically possible to start with a proper prior and end with an improper posterior. Minka's answer gives an explicit example in which it happens. I tried to complement it with my answer and point out that this can only happen inside a set of zero prior predictive probability. – Zen Mar 07 '14 at 21:55
2

@Zen It seems to me that a closely related interpretation is "if the posterior is not proper, what information can I get from it?" This accepted answer looks like it provides useful and correct advice related to that in a special circumstance (which is clearly described). The acceptance looks to me like a signal that eretmochelys struck home with a shrewd guess about the circumstances. – whuber Mar 07 '14 at 22:03

score 1 · Answer 5 · 2020-02-04T08:42:10.507

Later is better than never. Here is a natural and useful counterexample I believe, arising from Bayesian nonparametrics.

Suppose ${\mathbf{x}} = \left( {{x_1},...,{x_i},...{x_n}} \right) \in {\mathbb{R}^n}$ has posterior probability distribution

$p\left( {\left. {\mathbf{x}} \right|D} \right) \propto {e^{ - \frac{1}{2}{{\mathbf{x}}^{\mathbf{T}}}{\mathbf{Ax}} + {{\mathbf{J}}^{\mathbf{T}}}{\mathbf{x}}}}$

We want to evaluate the posterior expectation $\mathbb{E}\left. {\mathbf{x}} \right|D$. If ${\mathbf{A}}$ is positive definite, then let

$I \triangleq \int\limits_{{\mathbb{R}^n}} {{e^{ - \frac{1}{2}{{\mathbf{x}}^{\mathbf{T}}}{\mathbf{Ax}} + {{\mathbf{J}}^{\mathbf{T}}}{\mathbf{x}}}}{{\text{d}}^n}{\mathbf{x}}} = \sqrt {{{\left( {2\pi } \right)}^n}{{\left| {\mathbf{A}} \right|}^{ - 1}}} {e^{\frac{1}{2}{{\mathbf{J}}^{\mathbf{T}}}{{\mathbf{A}}^{ - 1}}{\mathbf{J}}}}$

By Leibniz rule/Feynman trick, we have

$ \frac{{\partial I}}{{\partial {J_j}}} = \int\limits_{{\mathbb{R}^n}} {\frac{{\partial {e^{ - \frac{1}{2}{{\mathbf{x}}^{\mathbf{T}}}{\mathbf{Ax}} + {{\mathbf{J}}^{\mathbf{T}}}{\mathbf{x}}}}}}{{\partial {J_j}}}{{\text{d}}^n}{\mathbf{x}}} = \int\limits_{\,{\mathbb{R}^n}} {{x_j}{e^{ - \frac{1}{2}{{\mathbf{x}}^{\mathbf{T}}}{\mathbf{Ax}} + {{\mathbf{J}}^{\mathbf{T}}}{\mathbf{x}}}}{{\text{d}}^n}{\mathbf{x}}} = \\ \frac{\partial }{{\partial {J_j}}}\sqrt {{{\left( {2\pi } \right)}^n}{{\left| {\mathbf{A}} \right|}^{ - 1}}} {e^{\frac{1}{2}{{\mathbf{J}}^{\mathbf{T}}}{{\mathbf{A}}^{ - 1}}{\mathbf{J}}}} = \sqrt {{{\left( {2\pi } \right)}^n}{{\left| {\mathbf{A}} \right|}^{ - 1}}} {e^{\frac{1}{2}{{\mathbf{J}}^{\mathbf{T}}}{{\mathbf{A}}^{ - 1}}{\mathbf{J}}}}\frac{\partial }{{\partial {J_j}}}\frac{1}{2}{{\mathbf{J}}^{\mathbf{T}}}{{\mathbf{A}}^{ - 1}}{\mathbf{J}} = \\ \frac{1}{2}I\frac{\partial }{{\partial {J_j}}}{{\mathbf{J}}^{\mathbf{T}}}{{\mathbf{A}}^{ - 1}}{\mathbf{J}} = I\sum\limits_{i = 1}^n {{\mathbf{A}}_{ij}^{ - 1}{{\mathbf{J}}_i}} \\ $

Therefore

$\mathbb{E}\left. {{x_j}} \right|D = \frac{{\int\limits_{\,{\mathbb{R}^n}} {{x_j}{e^{ - \frac{1}{2}{{\mathbf{x}}^{\mathbf{T}}}{\mathbf{Ax}} + {{\mathbf{J}}^{\mathbf{T}}}{\mathbf{x}}}}{{\text{d}}^n}{\mathbf{x}}} }}{{\int\limits_{\,{\mathbb{R}^n}} {{e^{ - \frac{1}{2}{{\mathbf{x}}^{\mathbf{T}}}{\mathbf{Ax}} + {{\mathbf{J}}^{\mathbf{T}}}{\mathbf{x}}}}{{\text{d}}^n}{\mathbf{x}}} }} = \sum\limits_{i = 1}^n {{\mathbf{A}}_{ij}^{ - 1}{{\mathbf{J}}_i}} $

and

$\mathbb{E}\left. {\mathbf{x}} \right|D = {{\mathbf{A}}^{ - 1}}{\mathbf{J}}$

Now, if ${\mathbf{A}}$ is only positive semi-definite and singular, so that $p\left( {\left. {\mathbf{x}} \right|D} \right)$ is improper, degenerate and

$\int\limits_{{\mathbb{R}^n}} {p\left( {\left. {\mathbf{x}} \right|D} \right){{\text{d}}^n}{\mathbf{x}}} = + \infty $

it suffices to replace the matrix inverse ${{\mathbf{A}}^{ - 1}}$ by its Moore-Penrose pseudoinverse ${{\mathbf{A}}^ + }$ to get

$\mathbb{E}\left. {\mathbf{x}} \right|D = {{\mathbf{A}}^ + }{\mathbf{J}}$

IT WORKS. Same for higher moments. So, it seems that a Bayesian posterior does not need to be proper/non-degenerate in order to be proper, that is to yield legitimate and useful inferences.

@Xi'an Thanks for the feedback. You should better ask Richard P. Feynman. This is routine calculation in QM. See for instance Path integral formulation, section Expectation values. The expectation $\left\langle F \right\rangle $ is defined as the quotient of two infinite path integrals. I agree this is wild maths but it DOES WORK in applications. — , Feb 04 '20 at 09:51
@Xi'an To recap: defining expectations and moments as ratios of two infinite quantities is routine in QM/QFT and Bayesian nonparametrics. This yield, concrete, accurate and useful inferences. This invalidate your claim (the proof is on your side, not Feynman's) of the neccessity of a finite (=1) posterior normalization constant. From my point of view, the real mystery is elsewhere: in Bayesian nonparametrics, those proper-improper posteriors arise from kind of improper priors. Therefore, by the same reasoning, they should be proper as well: they should have a well-defined expectation too. — , Feb 04 '20 at 14:08
But if this were true, this would damage regularization when constraining the underlying function derivatives (e.g. second-order smoothing prior): the prior would not only constrain the derivates as expected but also the function itself, which is undesirable. But this is NOT true, it seems the prior has actually no expectation because experiments show that there is no damage, no bias in the posterior expectation with a positive semi-definite, singular prior covariance regularization matrix. This, I don't understand, but that's an experimental fact. — , Feb 04 '20 at 14:13

score -2 · Answer 6 · answered Feb 20 '14 at 17:10

Improper posterior distribution only arises when you're having an improper prior distribution. The implication of this is that the asymptotic results do not hold. As an example, consider a binomial data consisting of $n$ success and 0 failures, if using $Beta(0,0)$ as the prior distribution, then the posterior will be improper. In this situation, the best is to think of a proper prior distribution to substitute your improper prior.

This answer is incorrect. See my answer. – Tom Minka Mar 07 '14 at 13:07 — Tom Minka, Mar 07 '14 at 13:07

Does the Bayesian posterior need to be a proper distribution?

6 Answers6

Linked