Can the "true" prior lead to better posterior estimation?

Question

Suppose we know $X_1,\dots, X_n \sim N(\mu,1)$, and $\mu\sim N(1,1)$, so the true prior is $N(1,1)$. Now if we want to compare the true prior with $N(2,1)$, can we say the true prior is better than the wrong one in any sense if we want to estimate $\mu$ using the MLE of the posterior?

My guess is that we can take the expectation of the square error: $$\mathbb E_{\mu \sim N(1,1)} \|\hat \mu - \mu\|$$ where $\hat \mu$ is the MLE of the posterior. If we use the $N(1,1)$ as our prior, will it be better?

I think model comparison is what you are looking for, i.e., compare the two models' evidence (or equivalently marginal likelihood, normalizing constant). This is notoriously difficult to estimate though. Another approach is to consider cross validation methods. — fool, Dec 22 '20 at 06:10
There is no such thing as a true prior. If you know the truth then you know the value of $\mu$ exactly, i.e. $\mu = 1$ without any uncertainty. — Cagdas Ozgenc, Dec 22 '20 at 07:47
F. Samaniego has a book working out the use of a "wrong" prior. — Xi'an, Dec 22 '20 at 09:07
@CagdasOzgenc it's possible to know $\mu$ was drawn from a $N(1, 1)$ distribution without knowing the value of $\mu$. — fblundun, Dec 22 '20 at 17:31
@fblundun This is true, but then another guy approaches me and tells me that it is drawn from $N(1,0.5)$ which prior is more true? — Cagdas Ozgenc, Dec 22 '20 at 19:13
@CagdasOzgenc $N(1,1)$ is truer for the exercise where analysts have to do inference about $\mu$ used in my simulation, and I in fact generate it by drawing it from $N(1,1)$. It seems uncontroversial that getting the prior right will lead to better inference in this sense. — CloseToC, Dec 22 '20 at 22:35

Sextus Empiricus · Answer 1 · 2024-03-25T16:04:27.603

Recapitulation of the question

Let's consider the following data generating process (considering only the mean $\bar{X}$, instead of multiple $X_i$, since the mean is a sufficient statistic and simplifies a lot):

$$\begin{array}{rll} \mu &\sim& N(\mu_t,1)\\ \bar{\epsilon}_n &=& N(0,\sqrt{n}) \\ \bar{X}_n &=& \mu + \bar{\epsilon}_n \end{array} $$

and the aim is to infer $\mu$ based on observations of the average $\bar{X}_n$ .

The true distribution of $\mu$ is $N(\mu_t,1)$, and the question is whether using a prior equal to this true distribution (which I assume is what is meant with the "true prior") results in the best maximum a posteriori probability (MAP) estimate based on the expectation value of the squared error.

^{The idea of a 'true prior' is not always appropriate. It is not clear what it means. Sometimes a value to be estimated may be considered to have a degenerate distribution. This is for example the case when we try to measure physical constants. So I rephrased the question here to be about an example where the parameter to be estimated is assumed to follow some distribution, and that distribution is considered as the 'true prior'.}

Two priors

Below we compare two different priors

$$\mu_{m} \sim N(m,1)$$ and $$\mu_{\tau} \sim N(\mu_t,1/\tau)$$

One of them differs from the true distribution of $\mu$ by assuming a different mean $m$, the other by assuming a different precision $\tau$.

Below we will see that

for the prior $\mu_m$ the lowest expectation value for the mean squared error is obtained when $m = \mu_t$ (that is, when the prior equals the 'true distribution').
for the prior $\mu_\tau$ the optimum is also the value $\tau = 1$.

^{Note: While creating this answer I had expected that the optimum would be at a different $\tau \neq 1$ and due to some regularization there would be a situation where a prior different from the true data generating process might be an improvement. But after working it out it appears to be no improvement. Yet, I still believe that there should be ways to improve the estimate for other situations (different distributions and priors).}

Computation stuff

Since the normal distribution is a conjugate prior, the posterior will be a normal distribution as well, and the mean of that distribution will be the maximum a posteriori probability (MAP) estimate.

For the two different prior distributions (which we will identify by using subscripts with letters $_m$ and $_\tau$) we can express the MAP estimate as function of the true value of $\mu$ and the mean $\bar\epsilon_n$ (where $\bar{X}_n = \mu + \bar{\epsilon}_n$), as follows:

$$ \begin{array}{} \hat{\mu}_m(\mu,\bar{\epsilon}) = \frac{m+n \bar{X}_n}{1 +n}\\ \hat{\mu}_\tau(\mu,\bar{\epsilon}) = \frac{\tau\mu_t+n \bar{X}_n}{\tau +n} \end{array}$$

and the error $e = \hat{\mu} - \mu$ is

$$ \begin{array}{} e_m(\mu,\bar{\epsilon}) = \frac{m+n\mu+ n \hat\epsilon}{1 +n} - \mu = \frac{(m-\mu)+ n \hat\epsilon}{1 +n}\\ e_\tau(\mu,\bar{\epsilon}) = \frac{\tau\mu_t+n \mu + n \hat\epsilon}{\tau +n} - \mu = \frac{\tau(\mu_t-\mu) + n \hat\epsilon}{\tau +n} \end{array} $$

The sampling distribution of these estimates are normal distributed variables (since they are a linear sum of $\bar{\epsilon}$ and $\mu$) and the expected mean squared error are the raw second moments of those variables

$$ \begin{array}{} E_{\mu,\bar{\epsilon}}[e_m^2] = \left(\frac{m-\mu_t}{1+n}\right)^2 + \left(\frac{1}{1+n}\right)^2 + \left(\frac{\sqrt{n}}{1+n}\right)^2 \\ E_{\mu,\bar{\epsilon}}[e_\tau^2] = \left(\frac{\tau}{\tau+n}\right)^2 + \left(\frac{\sqrt{n}}{\tau+n}\right)^2 \end{array} $$

Simulations

The code below can help to interpret and verify the above formulae

n = 10
set.seed(1)
sim_m = function(n,m=0) {
  true_mu = rnorm(1)
  x = rnorm(1,true_mu,1/sqrt(n))
  estimate = (m + n*x)/(1 + n)
  return(estimate-true_mu)
}
sim_tau = function(n,tau=1) {
  true_mu = rnorm(1)
  x = rnorm(1,true_mu,1/sqrt(n))
  estimate = (n*x)/(tau + n)
  return(estimate-true_mu)
}
m = seq(-2,2,0.2)
e_m = sapply(m, FUN = function(m) {
  mean(replicate(10000, sim_m(n,m)^2))
})
plot(m,e_m, xlab = expression(mu[prior]-mu[true]), ylab = "expected error^2")
lines(m, (m/(1+n))^2 + (1+n)/(1+n)^2)
tau = seq(0,2,0.1)
e_tau = sapply(tau, FUN = function(m) {
  mean(replicate(10000, sim_tau(n,m)^2))
})
plot(tau,e_tau, xlab = "prior precision",  ylab = "expected error^2")
lines(tau, (tau^2+n)/(tau+n)^2)

In the example the prior that is equal to the true model gives the best estimates in terms of expected squared error. I wonder whether there are different cases where this is not true. For example, situations where some biased prior can reduce the variance of the MAP estimate. E.g. a prior $N(1,\sigma_0)$ with $\sigma_0 < 1$. This can be easily computed and I will do that later. I suspect that in the case of a different variance from the true variance, the result might actually be better at a certain point. — Sextus Empiricus, Mar 21 '24 at 09:18
So, I recomputed the situation with prior $N(1,\sigma_0)$ and still the optimum is when the prior equals the true data generating distribution. I had expected some slightly different optimum. — Sextus Empiricus, Mar 25 '24 at 16:12
Of course, with some biased cost function, a biased prior should be advantageous. E.g. when we have a cost function like $f(e) = \min(0,e)^2+e^2$, then this will favor underestimation to create more negative values of $e$. (I was hoping to find a simple example with the squared error cost function) — Sextus Empiricus, Mar 25 '24 at 16:16

Can the "true" prior lead to better posterior estimation?

1 Answers1

Recapitulation of the question

Two priors

Computation stuff

Simulations

Linked