2

So, starting from no information besides N trials from a Gaussian with $\mu = 0$, I'd like to know the best Bayesian posterior for the unknown variance, $\sigma^2$.

My approach so far as been to assume a uniform prior for $\tau = \frac{1}{\sigma}$, i.e. the scaling factor for the standard Gaussian. In addition to being more intuitive (to me), this is also necessary, since attempting to use a uniform prior for $\sigma$ or $\sigma^2$ would result in a divergent integral.

Then for N observations, this results in a posterior distribution of:

$$\Pr(\tau|x_1,x_2,...,x_n) = \frac{S^{n+1}\tau^n}{\Gamma(\frac{n+1}{2})2^{\frac{n-1}{2}}}\exp(-\frac{\tau^2}{2}S^2)$$

where $S^2 = x_1^2+x_2^2+\cdots+x_n^2$, i.e. the sum of squares of observations.

If we transform variables to the variance, then we have:

$$\Pr(\sigma^2|x_1,x_2,...,x_n) = \frac{(\frac{S^2}{2})^{\frac{n+1}{2}}}{\Gamma(\frac{n+1}{2})(\sigma^2)^{\frac{n+3}{2}}}\exp(-\frac{S^2}{2}\frac{1}{\sigma^2}) = InvGamma({\sigma^2;\alpha = \frac{n + 1}{2},\beta = \frac{S^2}{2})}.$$

Thankfully, this is none other than the Inverse Gamma Distribution with parameters $\alpha = \frac{n + 1}{2}, \beta = \frac{S^2}{2}$, which comports with this rather cryptic statement on this wikipedia page https://en.wikipedia.org/wiki/Inverse-gamma_distribution:

"Perhaps the chief use of the inverse gamma distribution is in Bayesian statistics, where the distribution arises as the marginal posterior distribution for the unknown variance of a normal distribution, if an uninformative prior is used"

So my question: Well, this is all well and good, but I haven't found anything resembling this analysis anywhere else online, which is worrying. Moreover, the mean of the posterior distribution I derived is $\frac{S^2}{n-1}$, and the mode is $\frac{S^2}{n+3}$. I'd sort of suspect at least one of those would be $\frac{S^2}{n}$ since we know the mean, and we therefore don't lose a degree of freedom when estimating the variance....

Help point me in the right direction? I'm confused.

User1865345
  • 8,202
SSD
  • 215
  • This is very standard, since it corresponds to the non-informative analysis of one Gamma $\mathcal G(n/2,\tau^2/2)$ observation (since $S$ is a sufficient statistic). There is no compelling reason to recover $S^2/n$ as the posterior mean or mode. – Xi'an Jan 31 '24 at 08:45

1 Answers1

2

I find the comment

My approach so far as been to assume a uniform prior for $=1/$, i.e. the scaling factor for the standard Gaussian. In addition to being more intuitive (to me), this is also necessary, since attempting to use a uniform prior for $$ or $^2$ would result in a divergent integral.

confusing since all these priors are improper, i.e. impossible to normalise into densities. But they also produce proper posteriors for $n$ large enough.

For instance, when $$\kappa=\tau^2=\sigma^{-2}\quad \text{and}\quad \pi(\kappa)=\kappa^{\alpha}\quad(\alpha\in\mathbb R)$$ the posterior $\pi(\kappa|n,S_n^2)$ is defined iff $$\int_0^\infty \kappa^{\alpha+\frac{n}{2}}e^{-\kappa S_n^2/2}\,\text d\kappa<\infty$$ that is when $$\alpha+\frac{n}{2}>-1$$

Concerning the remainder of the question this is standard conjugate Bayesian analysis. The posterior (inverse Gamma) distribution will obviously depend on the choice of the (inverse Gamma) prior and hence so do the posterior mean, mode, median. There is no reason to automatically recover $S^2/n$, even though some prior $\pi(\sigma^2)=\sigma^{-\alpha}$ does.

The posterior mean of $\kappa^\beta$ ($\beta\in\mathbb R$) is then $$\mathbb E[\kappa^\beta|n,S_n^2]=\dfrac{\int_0^\infty \kappa^{\beta+\alpha+\frac{n}{2}}e^{-\kappa S_n^2/2}\,\text d\kappa<\infty}{\int_0^\infty \kappa^{\alpha+\frac{n}{2}}e^{-\kappa S_n^2/2}\,\text d\kappa<\infty}$$ which exists when $$\min\left\{\beta+\alpha+\frac{n}{2},\alpha+\frac{n}{2}\right\}>-1$$ and equal to $$\mathbb E[\kappa^\beta|n,S_n^2]=\dfrac{\Gamma(\beta+\alpha+\frac{n}{2})}{\Gamma(\alpha+\frac{n}{2})}\,(S^2_n/2)^{-\beta}$$ When $\beta=-1$, \begin{align}\mathbb E[\kappa^{-1}|n,S_n^2]&=\mathbb E[\sigma^2|n,S_n^2]\\ &=\dfrac{\Gamma(\alpha+\frac{n}{2}-1)}{\Gamma(\alpha+\frac{n}{2})}\,(S_n^2/2)\\ &=(\alpha+\frac{n}{2}-1)^{-1}\,(S_n^2/2)\\ &=\dfrac{S^2_n}{2\alpha+n-2} \end{align} which is $\frac{S_n^2}n$ when $\alpha=1$.

Concerning

Question: Is there any canonical choice of prior such that my posterior is truly optimal, given only N observations?

there is no answer unless one defines optimal via a criterion that does not depend on the prior, i.e., a frequentist criterion like (loss dependent) admissibility, (loss dependent) minimaxity, scale invariance, minimal length confidence interval, &tc.

And

Question: Jeffrey's prior $1/^2$ on the variance is equivalent to assuming a uniform distribution on the logarithm of variance, i.e. $=\log(^2)$. Perhaps my assumptions about what constitutes the most reasonable choice of parameter to assume uniformity of is biased and incorrect?

is correctly stating that $\log(\sigma)$ then enjoys a flat prior (and not a Uniform prior since there is no Uniform distribution over the real like $\mathbb R$), but this is not an absolute argument in defence of this Jeffreys' prior (which should be called Lhoste prior). There is no general agreed principle (and there should be none) on the selection of an objective or reference prior. As for a "reasonable parametrization", I would personally favour a prior [construction [procedure]] that does not depend on the parametrization, which is the case for Jeffreys' prior derivation.

Xi'an
  • 105,342
  • I mean, when you try to marginalize out the variance in the normal distribution:

    $\int_0^\infty \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{x^2}{2\sigma^2})d\sigma^2$ blows up to $\infty$

    Whereas, if you parameterize by $\tau = \frac{1}{\sigma}$ instead and marginalize $\tau$ out, the integral equals $\frac{1}{\sqrt{2\pi}x^2}$.

    Though, this is for $n = 1$. You mentioned that for large enough n, a uniform prior on $\sigma$ or $\sigma^2$ produces proper posteriors? That's intriguing. However, I should be able to produce a posterior, even for $n = 1$, so my point still stands I think.

    – SSD Jan 31 '24 at 16:25
  • I did a little more analysis on my own. It seems that the updating rule for inverse gamma distributions is $\alpha' = \alpha + \frac{n}{2}$ and $\beta' = \beta + \frac{S^2}{2}$. So this corresponds to my uniform prior on $\frac{1}{\sigma}$ being equivalent to $InvGamma(\alpha = \frac{1}{2}, \beta = 0)$. Whereas for the Jeffreys Prior, this corresponds to $InvGamma(\alpha = 0, \beta = 0)$.

    Question: Is there any canonical choice of prior such that my posterior is truly optimal, given only N observations?

    – SSD Jan 31 '24 at 16:29
  • Lastly - it seems that Jeffrey's prior $1/\sigma^2$ on the variance is equivalent to assuming a uniform distribution on the logarithm of variance, i.e. $z = log(\sigma^2)$. Perhaps my assumptions about what constitutes the most reasonable choice of parameter to assume uniformity of is biased and incorrect? – SSD Jan 31 '24 at 16:47
  • 1
    Thank you very much for your additional insights in the edit.

    Did you perhaps mean that equality is achieved when $\alpha = 1$?

    If I may, I want to wonder out loud: You point out that your preference is to use parameterization independent priors, i.e. Jeffreys priors.

    It seems it is generally possible to find a unique parameterization which enjoys a "flat" Jeffreys prior (not necessarily uniform as you pointed out).

    Subjectively speaking (I can't help it, I'm philosophically minded) - would you say this singles out that parameterization as somehow uniquely appropriate?

    – SSD Feb 01 '24 at 17:02
  • Thanks for pointing out the mistake. Concerning constant information parameterizations, this entry shows that the solution is \begin{equation} \eta(\theta) \propto \int_0^\theta \sqrt{\mathcal{I}(t)} \mathrm{d}t + \text{cst}. \end{equation} in dimension one. – Xi'an Feb 02 '24 at 06:50
  • In larger dimensions, $d>1$, I am unsure this can be achieved since the number of simultaneous differential equations to solve grows as the squared dimension $d^2$. (Link to the entry for the differential equation resolution.) – Xi'an Feb 02 '24 at 06:56
  • Purely subjectively speaking, do you feel the presence of variables with a flat Jeffreys prior indicates that those parameterizations are "special" in a certain sense? E.g. the Jeffreys prior for Bernoulli trials https://en.wikipedia.org/wiki/Jeffreys_prior#Bernoulli_trial, it turns out that one can think of a uniform circle distribution, and the project that onto the X-axis (losing the 2D information in the process). Does that hint that a Bernoulli trial is a reduction of some more fundamental way of looking at the situation? – SSD Feb 03 '24 at 17:55