Why not use the same distribution for the prior in Bayesian statistics?

Question

I am wondering why introductory books on statistics use a conjugate distribution family for the prior instead of using the same pdf of the one we are trying to infer the parameters?

For example, the binomial distribution has the Beta distribution as the prior. They are very similar except for a normalization constant due to the Beta distribution. Why not use another binomial distribution (which can be also uninformative) as the prior?

A case where I see the conjugate prior as necessary is the case when inferring $\Sigma$ of a multivariate Normal given $\mu$ which certainly needs a distribution with symmetrical paramters (e.g. Wishart/Inverse Wishart distribution).

If the observations follow a binomial distribution, the prior is for the success probability. How would you use a binomial distribution as the prior for a probability? — Accidental Statistician, Sep 15 '22 at 11:36
@AccidentalStatistician the beta and binomial distribution is very identical except for their normalization constant. An interesting question is if I plugged in a binomial prior and compute the constant $1/Z$ such that the resulting integral is $1$, would the normalization constant correspond to that of the beta ? — wd violet, Sep 15 '22 at 11:39
@Han beta and binomial are not the same, it's not only about the constant. Binomial is discrete and sums to one, beta is continuous and integrates to one, so "changing the constant" would make them improper distributions. — Tim, Sep 15 '22 at 11:43
If the binomial probability mass function is considered as a function of the success probability parameter rather than the observation, it's proportional to the beta density function for that parameter. This is expected, since it's what makes the beta distribution conjugate to the binomial distribution. It does not make the distributions the same. — Accidental Statistician, Sep 15 '22 at 15:41
The probability spaces are very different. One way to see this is in the binomial the number of successes is the random variable and you are summing over $k$ in $\sum_k {n\choose k}p^k (1-p)^{n-k} = 1$, while in the beta the parameter $p$ is the random variable you are integrating over $p$ in $\int_p \frac{1}{B(\alpha,\beta)}p^\alpha(1-p)^\beta, dp =1$. But they fit together well as a likelihood and conjugate prior — Henry, Sep 15 '22 at 19:43
@Han The parameter of the binomial is defined on the unit interval of the real line, not on the values $\left{0,1,2,...,n\right}$ nor even on $\left{0,\frac{1}{n},\frac{2}{n},...,1\right}$. The prior for $p$ and the pmf $f(x)$ you're using for the data are operating in orthogonal directions on $(p,x)$ — Glen_b, Sep 16 '22 at 00:42
Thank you all for your inputs. I finally realized that the problem is not to infer the probability itself but its parameters. — wd violet, Sep 16 '22 at 01:11

score 14 · Accepted Answer · answered Sep 15 '22 at 11:38

For example, the binomial distribution has the Beta distribution as the prior. They are very similar except for a normalization constant due to the Beta distribution. Why not use another binomial distribution (which can be also uninformative) as the prior?

Actually, it's a great counterexample. The beta-binomial model is

$$\begin{align} p &\sim \mathsf{Beta}(\alpha, \beta) \\ X &\sim \mathsf{Bin}(p, n) \end{align}$$

where $X$ is a discrete random variable $X \in \{ 0, 1, 2, 3, \dots , n\}$, and the binomial distribution is parametrized by probability of success $p$ and the sample size $n$. We cannot use binomial distribution as a prior for $p$, because $p$ is continuous and bounded $p \in (0, 1)$. The same applies to many other distributions, for example, a prior for the Poisson's $\lambda$ parameter would be continuous and non-negative, unlike discrete Poisson distribution.

I think that the beta-binomial model is the consequence of using a beta as the prior to a binomial model. May I ask if it is a necessary condition to have a continuous distribution (that is also conjugate) as the prior distribution to the (discrete) binomial? — wd violet, Sep 15 '22 at 11:47
@Han beta is a prior for the parameter $p$, the "probability of success". It is not a prior "for" binomial distribution, but for one of its parameters. — Tim, Sep 15 '22 at 11:55

score 7 · Answer 2 · answered Sep 15 '22 at 21:45

The nature of your question itself suggests a conceptual misunderstanding.

When we consider a binomial PMF, e.g. $$X \sim \operatorname{Binomial}(n, p)$$ with $$\Pr[X = x] = \binom{n}{x} p^x (1-p)^{n-x}, \quad x \in \{0, 1, \ldots, n\},$$ the support of this random variable is $X \in \{0, 1, 2, \ldots, n\}$. This represents the set of possible elementary outcomes of $X$, and it is in this regard that the sum of the probabilities of all such outcomes equals unity: $$\sum_{x=0}^n \Pr[X = x] = \sum_{x=0}^n \binom{n}{x} p^x (1-p)^{n-x} = (p + (1-p))^n = 1^n = 1$$ by the binomial theorem.

However, for a beta distributed random variable, say $P \sim \operatorname{Beta}(a,b)$, the probability density is $$f_P(p) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} p^{a-1} (1-p)^{b-1}, \quad 0 < p < 1$$ where $a$ and $b$ are shape parameters. Here, $0 < P < 1$ is the support representing the set of elementary outcomes of $P$, and the "sum" of the probabilities of all such outcomes is $$\int_{p=0}^1 f_P(p) \, dp = 1.$$

These distributions do not have the same behavior at all. As other answers have pointed out, it makes no sense to model the (Bayesian) prior $p$ in a binomial likelihood with another binomial distribution because $p$ is a probability, not a count. The notion that $\binom{n}{x}$ and $\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}$ are just "normalization constants" reflects a fundamental misunderstanding that occasionally occurs among students of Bayesian statistics.

The idea behind conjugate distributions really has to do with the concept of the kernel of a probability function. The kernel is the part that depends on the parameter(s), and excludes any multiplicative factors that are constant with respect to those parameter(s). For instance, with respect to the parameter $p$, the kernel of the binomial PMF is

$$\ker(\Pr[X=x]) = p^x (1-p)^{n-x}.$$ The factor $\binom{n}{x}$ is excluded because it does not depend on $p$. The kernel is the basis for the likelihood function with respect to those same parameter(s); e.g., $$\mathcal L(p \mid x) \propto \ker(\Pr[X = x]).$$ The essence of this idea is that a likelihood is a function of the parameter(s) for some observed data $x$; as such, it is only uniquely determined up to a constant of proportionality. For instance, if $X \sim \operatorname{Binomial}(n = 5, p)$ and we observed $X = 3$, we could write the likelihood function of $p$ as $$\mathcal L(p \mid X = 3) = p^3 (1-p)^2, \quad 0 < p < 1,$$ or we could write it as $$\mathcal L(p \mid X = 3) = 157839027384 p^3 (1-p)^2, \quad 0 < p < 1.$$ It doesn't matter because $\mathcal L$ need not satisfy $\int_{p=0}^1 \mathcal L(p) \, dp = 1$. There is a choice that makes this true for general $n, x$, and when we do choose this, we get a beta distribution over $p$. This is why the beta distribution is the conjugate prior for a binomial likelihood.

As another example, if we have a Poisson distributed random variable $\Pr[Y = y] = e^{-\lambda} \lambda^y/y!$ with unknown rate $\lambda$, its kernel with respect to $\lambda$ is $$\ker(\Pr[Y = y]) = e^{-\lambda} \lambda^y, \quad y > 0.$$ So its likelihood is $$\mathcal L(\lambda \mid y) \propto e^{-\lambda} \lambda^y$$ which is proportional to a gamma density with shape $y+1$ and rate $1$; hence the gamma distribution is the conjugate prior for a Poisson likelihood.

score 3 · Answer 3 · answered Sep 15 '22 at 11:31

While I can't speak for textbook authors, I can think of two reasons why they might choose to do this. Both come from the fact that a conjugate prior will lead to a posterior distribution of a known distribution family.

(1) Simplicity: A well-written textbook should distill the content to the key points, and an example where the posterior does not simplify to a convenient form would add clutter that isn't relevant to discussion. Perhaps the main point could be: the posterior distribution will be a valid pdf or pmf (oh hey, it's a distribution we recognize!)

(2) Computational convenience: With conjugate distributions, algorithms such as Gibbs sampling will be more efficient, as samples can be drawn directly from target distributions without using approximate methods like slice or rejection sampling.

(bonus reason) If the posterior distribution simplifies to a known distribution, closed-form methods can be used rather than MCMC.

In regard to your beta-binomial example, a beta distribution is continuous and bounded on (0,1), which matches the target distribution of the binomial parameter $p$. Using a binomial distribution as a prior for $p$ wouldn't make sense, because its support is discrete, whole-number values up to and including some parameter $n$.

This is by no means a complete answer, and others are welcome to chime in!

Why not use the same distribution for the prior in Bayesian statistics?

3 Answers3