1

I have a parameter $\theta$ and data $y = f(\theta) + \mathrm{noise}$. My goal is finding the best fit for $\theta$ and assess the uncertainty I have on this best fit. I see two competing approaches for doing this:

  1. I can compute the MLE $\widehat \theta =\arg \max_\theta p(y \mid \theta)$ with some optimization algorithm and assess the reliability of the estimator, using the Fisher information $I(\theta) =- \nabla^2_\theta p(y \mid \theta)$, and evaluating its inverse $I^{-1}(\widehat \theta)$ at the MLE. This would be a good option, since I can evaluate the Hessian analytically in my case, and finding the MLE is easy with some Newton algorithm.
  2. I can generate a sample $\{\theta^{(i)}\}_{i=1,\ldots, N}$ from $p(y \mid \theta)$, e.g., with MCMC sampling. I can then study the empirical covariance matrix of the sample to determine correlations, uncertainties, etc.

For a specific dataset and a eight-dimensional parameter, I get the following results:

enter image description here enter image description here

There seems to be a pattern here: the two matrices seem to be a multiple of each other. I struggle to understand to which extent this is true, and how the two approaches above are connected / distinct.

G. Gare
  • 83
  • 1
    Do https://stats.stackexchange.com/questions/68080/basic-question-about-fisher-information-matrix-and-relationship-to-hessian-and-s and https://stats.stackexchange.com/questions/316327/why-is-the-fisher-information-the-inverse-of-the-asymptotic-covariance-and-vi answer your question? The result is not limited to the Bayesian context. – Kuku Feb 24 '23 at 09:54
  • Are you sampling $p(y|\theta)$ or $p(y|\hat\theta)$? – Sextus Empiricus Feb 25 '23 at 08:54

2 Answers2

7

MCMC is a computational technique for implementing Bayesian inference. Normally, the difference between MLE and Bayesian inference is that Bayesian inference incorporates prior information. In this case, however, you appear to have omitted steps from MLE and Bayes so that both they collapse down to the same thing.

First, you have not actually computed Fisher information but rather observed information for the MLE estimate. Computing the Hessian of the log-likelihood at the MLE gives observed information. Fisher information requires the expected value of the Hessian over the distribution of $y$.

Secondly, and more importantly, you don't mention any prior information for the Bayes MCMC approach. By ignoring the prior, you are effectively assuming a uniform improper prior for $\theta$, in which case Bayes and MLE become virtually the same thing. The agreement between the two approaches becomes more complete because you are using observed rather than Fisher information for the MLE, and observed information is what is used in Bayesian inference.

So the two covariance matrices you estimate at the end are presumably not just multiples of one another but exactly the same, modulo MCMC convergence.

Gordon Smyth
  • 12,807
  • Everything you write is correct (I used a uniform prior on purpose to "sample from the likelihood", if you allow me the imprecision), but do not answer to my doubt: why are the observed information and the posterior covariance (in case of a uniform prior) exactly the same? where can I find a proof? – G. Gare Feb 27 '23 at 09:24
  • @G.Gare The likelihood and the posterior functions are identical in this case, so it is not at all clear what proof you could be after that they give the same downstream results. It would seem inevitable. Or are you actually after an explanation/proof of the idea of MCMC sampling as a general principle and why the covariance matrix obtained from the MCMC posterior sample is related to the Hessian? – Gordon Smyth Feb 27 '23 at 22:04
  • @G.Gare To elicit an answer with a detailed mathematical proof you would really need to state your question more precisely, explaining exactly what you did and what aspect of it you need an explanation for. I had to guess that you were doing Bayesian inference and that you were using a uniform prior. What you literally said you did, sampling $\theta^{(i)}$ from $f(|\theta)$ is of course impossible because $f(y|\theta)$ is a distribution for $y$ rather than for $\theta$. – Gordon Smyth Feb 27 '23 at 22:04
  • The thread https://stats.stackexchange.com/questions/316327/why-is-the-fisher-information-the-inverse-of-the-asymptotic-covariance-and-vi answered my question – G. Gare Feb 28 '23 at 08:34
1

Both the emperical Fisher information matrix and simulation are approximations. Because the methods approximate the same thing the results will look similar, but there are differences.

Emperical Fisher matrix

  • the estimation uses the observed maximum likelihood estimate instead of the true value of the parameters
  • the Fisher information matrix gives a lower bound for the variance, and depending on the distribution, the actual variance is only asymptotically equal to this lower bound

Simulations

  • the estimation uses the observed maximum likelihood estimate instead of the true value of the parameters
  • the sampling has a statistical variation and is not like an exact computation