26

I know that priors need not be proper and that the likelihood function does not integrate to 1 either. But does the posterior need to be a proper distribution? What are the implications if it is/is not?

ATJ
  • 1,861

6 Answers6

23

(It is somewhat of a surprise to read the previous answers, which focus on the potential impropriety of the posterior when the prior is proper, since, as far as I can tell, the question is whether or not the posterior has to be proper (i.e., integrable to one) to be a proper (i.e., acceptable for Bayesian inference) posterior.)

In Bayesian statistics, the posterior distribution has to be a probability distribution, from which one can derive moments like the posterior mean $\mathbb{E}^\pi[h(\theta)|x]$ and probability statements like the coverage of a credible region, $\mathbb{P}(\pi(\theta|x)>\kappa|x)$. If $$\int f(x|\theta)\,\pi(\theta)\,\text{d}\theta = +\infty\,,\qquad (1)$$ the posterior $\pi(\theta|x)$ cannot be normalised into a probability density and Bayesian inference simply cannot be conducted. The posterior simply does not exist in such cases.

Actually, (1) must hold for all $x$'s in the sample space and not only for the observed $x$ for, otherwise, selecting the prior would depend on the data. This means that priors like Haldane's prior, $\pi(p)\propto \{1/p(1-p)\}$, on the probability $p$ of a Binomial or a Negative Binomial variable $X$ cannot be used, since the posterior is not defined for $x=0$.

I know of one exception when one can consider "improper posteriors": it is found in "The Art of Data Augmentation" by David van Dyk and Xiao-Li Meng. The improper measure is over a so-called working parameter $\alpha$ such that the observation is produced by the marginal of an augmented distribution $$f(x|\theta)=\int_{T(x^\text{aug})=x} f(x^\text{aug}|\theta,\alpha)\,\text{d}x^\text{aug}$$ and van Dyk and Meng put an improper prior $p(\alpha)$ on this working parameter $\alpha$ in order to speed up the simulation of $\pi(\theta|x)$ (which remains well-defined as a probability density) by MCMC.

In another perspective, somewhat related to the answer by eretmochelys, namely a perspective of Bayesian decision theory, a setting where (1) occurs could still be acceptable if it led to optimal decisions. Namely, if $L(\delta,\theta)\ge 0$ is a loss function evaluating the impact of using the decision $\delta$, a Bayesian optimal decision under the prior $\pi$ is given by $$\delta^\star(x)=\arg\min_\delta \int L(\delta,\theta) f(x|\theta)\,\pi(\theta)\,\text{d}\theta$$ and all that matters is that this integral is not everywhere (in $\delta$) infinite. Whether or not (1) holds is secondary for the derivation of $\delta^\star(x)$, even though properties like admissibility are only guaranteed when (1) holds.

Xi'an
  • 105,342
  • What if $\int {\pi \left( \theta \right){\text{d}}\theta } = + \infty $, $\int {f\left( {\left. x \right|\theta } \right)\pi \left( \theta \right){\text{d}}\theta } = + \infty $ but $\frac{{\int {\theta f\left( {\left. x \right|\theta } \right)\pi \left( \theta \right){\text{d}}\theta } }}{{\int {f\left( {\left. x \right|\theta } \right)\pi \left( \theta \right){\text{d}}\theta } }} < + \infty $??? –  Feb 02 '20 at 21:25
  • Something like a Feynman path integral maybe. –  Feb 03 '20 at 20:19
  • I finally answered the question below. Please tell me whether you accept my counterexample or not. BR –  Feb 04 '20 at 08:36
  • In fact, I'm struggling to interpret my own counterexample because it seems the situation is different a priori and a posteriori!? Indeed, suppose ${\mathbf{x}}$ has the same kind of prior distribution. By the same reasoning, we would conclude that in any case, it has a prior expectation. But this is NOT true. On the contrary, a priori we would rather say that it has NO expectation at all if, for instance, the covariance matrix is a singular matrix obtained from a finite difference scheme without (Dirichlet, Neumann) boundary conditions on the derivatives of function $x\left( t \right)$. –  Feb 04 '20 at 09:09
  • We want to constrain the derivatives only, not the function itself. So, a priori we say that ${\mathbf{x}}$ has no expectation at all, but a posteriori we want to estimate it by computing its posterior expectation as described below! So, it is difficult to interpret, but that's how it works, definitely: would the prior covariance matrix be positive definite, ${\mathbf{x}}$ would have zero prior expectation and this would be undesirable from the regularization point of view. But it is singular and everything works perfectly. But it hopefully admit a well-defined posterior expectation! Weird. –  Feb 04 '20 at 09:16
  • So, will you finally acknowledge or not that QM/QFT à la Feynman provides the most striking and well-known (counter)examples of the fact that a posterior does not need to be normalized at all in order to yield meaningful and useful posterior moments??? See my answer below for details, whose purpose was just to provide another, more elementary and useful counterexample. –  Feb 05 '20 at 11:02
  • Thanks Prof. For sure, all that would deserve extensive discussions at the blackboard. Let me try another way please: stating that a posterior has to be proper in order to be proper is EXACTLY the same as stating that an underdetermined system of linear equations has no solutions. That's not true: the system actually has infinitely many solutions. In Bayesian nonparametrics, improper-proper posteriors precisely arise from underdetermined nonparametric models, typically additive models like $h\left( t \right) = f\left( t \right) + g\left( t \right)$. –  Feb 06 '20 at 08:34
  • Those models are underdetermined because functions $f\left( t \right)$ and $g\left( t \right)$ are determined only up to additive constants. It follows that the joint posterior for the parameters $\left( {f\left( {{t_1}} \right),...,f\left( {{t_n}} \right),g\left( {{t_1}} \right),...,g\left( {{t_n}} \right)} \right)$ to be estimated is improprer, for instance a degenerate multivariate Gaussian with a positive semi-definite covariance matrix. –  Feb 06 '20 at 08:40
  • But now, if you compute its posterior expectation as described below, given by its Moore-Penrose pseudoinverse, you will just get a particular solution/estimation of the functions $f\left( t \right)$ and $g\left( t \right)$, from which you can get all solutions upon request. But you typically don't care about those estimations because your goal was just to estimate the function $h\left( t \right)$. And it works perfectly. So if you ever want to deal with underdetermined, additive nonparametric models, you have to acknowledge that improper posterior nevertheless have proper expectations... –  Feb 06 '20 at 08:48
  • ... that just give you one particular solution/estimation among infinitely many of them, just like the Moore-Penrose inverse gives you one particular solution of an underdetermined system ${\mathbf{A}}x = b$. Is it more understandable now please? Improper-proper posteriors naturally arise from underdetermined models. –  Feb 06 '20 at 08:50
  • As far as I can understand, in QM/QFT, improper-proper distributions arise for another reason: path integrals are infinite because they are infinite-dimensional, functional (e.g. Gaussian) integrals. But their ratios are NOT. –  Feb 06 '20 at 09:02
  • Perhaps, I should better not have spent the last 20 years studying and applying probability theory alone in the industry. Improper-proper posteriors is just a starting point, from which there is a very, very deep algebraic theory of Bayesian nonparametric regularization to develop. Everything depends on the algebraic properties of the matrix pencil $\left( {{\mathbf{D}},{\mathbf{R}}} \right)$ where ${\mathbf{D}}$ is the data matrix and ${\mathbf{R}}$ is the regularization matrix. Three cases:... –  Feb 06 '20 at 09:23
  • ${\mathbf{D}}$ or ${\mathbf{R}}$ is positive definite. Easy. 2) None of them is positive definite but the matrix pencil $\left( {{\mathbf{D}},{\mathbf{R}}} \right)$ is regular. Solved 3) The matrix pencil $\left( {{\mathbf{D}},{\mathbf{R}}} \right)$ is singular. Hardcore but very exciting...
  • –  Feb 06 '20 at 09:25
  • Please forget what I say about the improper-proper priors, it's all the same: they also have prior expectations and moments, but they are just the those of one particular point (in an affine subspace of dimension $n$ if we constrain the $n$-th order derivative). You may like to use another name for those improper-proper moments, e.g. generalized or pseudo-moments, but I'm not aware of any standard terminology. –  Feb 06 '20 at 17:08