Intuition behind posterior predictive distribution

Question

I've recently encountered the "posterior predictive distribution" $$p(\bar{x}|X)=E_\theta[p(\bar{x}|\theta)]=\int_\theta p(\bar{x}|\theta)\hspace{0.5mm}p(\theta|X)d\theta$$ where $\bar{x}$ is a new point, $\theta$ is the (vector or 1-D) parameters of the distribution and $X$ is the already observed sample.

However I'm not sure I understand where this formula comes from. I know that we don't know the true value of $\theta$. I think that since we don't know $\theta$, we say "why settle for one estimate of it, and not scan all it's possible values?" Is that correct? Even if it is, I'm not certain that I understand the expected value part.

Sorry if my thoughts are a bit jumbled. Any insight would be appreciated, thanks.

periwinkle · Accepted Answer · 2019-11-27T16:59:21.673

Let $X$ denotes the observations and $\theta \in \Theta$ the parameter. In a Bayesian approach, both are considered random quantities. The first step of modeling is to define a statistical model, i.e. the distribution of $X$ given $\theta$, which can be written as $X \mid \theta \sim p(\cdot \mid \theta)$. This is mainly done by expliciting a likelihood function.
Thus our statistical model describe the conditional distribution of $X$ given $\theta$.
From a Bayesian perspective, we also define a prior distribution for $\theta$ on $\Theta$: $\theta \sim \pi(\theta)$.

The prior predictive distribution

Before observing any data, what we have is simply the chosen model, $p(x \mid \theta)$, and the prior distribution of $\theta$, $\pi(\theta)$. One can then ask to see what is the marginal distribution of $X$, that is, the distribution of $X \mid \theta $ averaged over all possible values of $\theta$.
This can be simply written using expectation: \begin{align*} p(x) &= \mathbb{E}_\theta \Big [ p(x \mid \theta) \Big ] \\ &= \int_\Theta p(x \mid \theta) \pi(\theta) d\theta. \end{align*}

The posterior predictive distribution

The interpretation is the same than for the prior predictive distribution, is it the marginal distribution of $X \mid \theta$ averaged over all values of $\theta$.
But this time the "weighting" function to be used is not $\pi(\theta)$ but our updated knowledge about $\theta$ after observing data $X^*$: $\pi(\theta \mid X^*)$.
Using the known known Bayes theorem we have: $$ \pi(\theta \mid X^*) = \frac{p(X^* \mid \theta) \pi(\theta)}{p(X^*)} $$ And thus, the marginal distribution of $X \mid (X^*,\theta)$ averaged over $\Theta$ is: $$ p(x \mid X^*) = \int_\Theta p(x \mid \theta) \pi(\theta \mid X^*)d\theta $$

Example: Gamma-Poisson mixture.

Suppose our observations are made of counts, $X$, and we define a Poisson model that is: $X \mid \lambda \sim \mathcal{P}(\lambda)$.
From a Bayesian perspective, we also define a prior distribution for $\lambda$.
For mathematical reasons, it is appealing to use a Gamma distribution, $\lambda \sim \mathcal{G}(a,b)$.

The prior predictive distribution

One particulariy of this Gamma-Poisson mixture, is that the marginal distribution will be distribution as a Negative-Binomial random variable.
That is, if $X \mid \lambda \sim \mathcal{P}(\lambda)$ and $\lambda \sim \mathcal{G}(a,b)$ then, $X \sim \mathcal{NB}\big (a,\frac{b}{b+1} \big )$.
Thus the prior predictive distribution of $X$ is a Negative Binomial distribution $\mathcal{NB}\big (a,\frac{b}{b+1} \big )$.

The posterior predictive distribution

Now, say we have observed $n$ counts $X =(X_1,\dots,X_n)$.
First, thanks to the choice of a Gamma prior for $\lambda$, the posterior distribution of $\lambda$ can be easily derived as being also a Gamma distribution: $$ \lambda \mid X \sim \mathcal{G} \bigg ( a + \sum_{i=1}^n X_i , b+n \bigg) $$

From what we saw for the prior predictive distribution, the posterior predictive distribution of $X$ will also be a Negative-Binomial: $$ \mathcal{NB} \bigg ( a + \sum_{i=1}^n X_i, \frac{b+n}{b+1+n} \bigg ) $$

Here is an example where $a=100$, $b=2$ and we observe the vector of counts $X=(85,80,70,65,71,92)$ :

Here is the R code to produce the plot :

### Gamma-Poisson mixture: prior and posterior predictive distributions :
require(ggplot2)
# Parameters of the prior distribution of lambda
a<-100
b<-2
x<-0:150
y1<-dnbinom(x,a,b/(b+1))


# Vector of observations and posterior predictive distribution
X<-c(85,80,70,65,71,92)
n<-length(X)
XS<-sum(X)
x<-0:150
au<-a+XS
bu<-b+n
y2<-dnbinom(x,size=au,prob=bu/(bu+1))


plot1<-ggplot() + aes(x=x,y=y1,colour="Prior") + geom_line(size=1)+
       geom_line(aes(x=x,y=y2,colour="Post"))+
       scale_colour_manual(breaks=c("Prior",    "Post"),
       values=c("#cd7118","#1874cd"),labels=c("Prior Predictive",
       "Postererior    Predictive"))+
       ggtitle("Prior and posterior predictive distributions for a=100 and b=2")+
       labs(x="x", y="Density") +
       theme(
           panel.background = element_rect(fill = "white",
                                colour = "white",
                                size = 0.5, linetype = "solid"),

           axis.line = element_line(size = 0.2, linetype = 'solid',
                         colour = "black"), 
           axis.text = element_text(size=10),
           axis.title = element_text(size=10),
           legend.title = element_blank(),
           legend.background = element_blank(),
           legend.key = element_blank(),
           legend.position = c(.7,.5)
         )
 plot1

Thank you for the detailed answer. So the idea in the prior predictive is, since we are uncertain about $\theta$, we can predict the $p(x)$ as the average of $p(x|\theta)$ in order to take into account all possible values of $\theta$ right? And the same logic is extended to the posterior predictive distribution. — thenac, Nov 27 '19 at 17:31
@Thomas Yes that's the idea. If you want to study some probabilities about the upcoming observations (for example for the gamma-poisson, the probability that the next count will be between 60 and 70) it's better to consider all possible values of $\theta$ and average using the probability (or density) of these values of $\theta$ than just using a fixed value for $\theta$. — periwinkle, Nov 27 '19 at 17:58
Alright, thanks a lot. Have a good day! And thanks for the code as well. — thenac, Nov 27 '19 at 18:10

Intuition behind posterior predictive distribution

1 Answers1

The prior predictive distribution

The posterior predictive distribution

Example: Gamma-Poisson mixture.

The prior predictive distribution

The posterior predictive distribution

Linked