0

I'm confused because if FIM is $I(\theta)=Var_x(s(\theta|x))=Var_x({d \over d \theta} log(L(\theta|x)))$ (variance of the score) and the MLE estimates are $\theta^*=dt * \sum^\infty_{t=0}{d\over d\theta_t}log(L(\theta_t|x))$ (gradient ascent of log-likelihood).

Then shouldn't the variance of the MLE estimate $\theta^*$ be

$Var_x(\theta^ *)\propto \sum^\infty_{t=0}Var_x({d\over d\theta_t} log(L(\theta_t|x))) = \sum^\infty_{t=0}I(\theta_t)$

given the variance sum law? That is to say proportional to the fisher information (across the optimization trajectory). But instead (as I understand it) we have that: $Var_x(\theta^ * )=I(\theta^ * )^{-1}$. Which is to say that fisher information is inversely proportional to variance of the MLE estimate.


My confusion is further compounded by the accepted answer to a similar question: "[...] One often finds the maximum likelihood estimator (mle) of by solving the likelihood equation ℓ˙()=0 when the Fisher information as the variance of the score ℓ˙() is large, then the solution to that equation will be very sensitive to the data, giving a hope for high precision of the mle. [...]"

But if the solution were very sensitive to our data (sample), then wouldn't there be greater risk that we had the wrong data and got a wildly incorrect solution? & Therefore fear of lower MLE precision?

profPlum
  • 327
  • Your formula for the MLE is not very clear. What does the $dt$ do there? – Sextus Empiricus Feb 16 '24 at 10:48
  • About your second part. It is not the solution that is sensitive to the data, but it is the $l'(\theta)$ that is sensitive. The distribution of $l'(\theta)$ will have a large variance. The solution of $l'(\hat\theta) = 0$ will select a different $\hat\theta \approx \theta$ that is closer to the true $\theta$ if small changes in $\theta$ wil have great influences on $l'$. – Sextus Empiricus Feb 16 '24 at 11:56
  • @SextusEmpiricus It's the optimization step size. – profPlum Feb 16 '24 at 14:32
  • that right hand side is equal to the steps in the gradient descent, and it is not equal to the MLE. – Sextus Empiricus Feb 16 '24 at 15:08
  • Asside from the duplicate, you may consider the distribution of $\hat{\theta}$ as function of different $\theta$. A graph like here: https://i.stack.imgur.com/LfQd6.png (from this question). If the distributions $f(\hat{\theta}|\theta)$ are narrow, then this causes steeper changes in the direction of $\theta$. Ie there is a relation between $\partial f(\hat{\theta}|\theta)/\partial \hat{\theta}$ and $\partial f(\hat{\theta}|\theta)/\partial {\theta}$. A smaller variance of $\hat\theta$ relates to larger steps in gradient descent. – Sextus Empiricus Feb 16 '24 at 15:15
  • ... that was one alternative intuitive reasoning that I had in mind for that duplicate question. But I preferred to express it in terms of the distribution of the data $f(x_1,x_2,\dots,x_n|\theta)$ instead of the distribution of the estimate $f(\hat\theta|\theta)$, since the Fisher information matrix is a property of the former and not the latter. But if you like , the latter might be a good intuitive shortcut to think about it. A smaller variance is a larger rate of change of the density function. – Sextus Empiricus Feb 16 '24 at 15:21
  • @SextusEmpiricus Isn't the MLE estimate the sum of the corresponding gradient ascent steps? That's my question; if we need to integrate (via gradient ascent) the score to get the MLE, and if the score is high variance then why wouldn't the MLE also be high variance (given variance sum law). – profPlum Feb 16 '24 at 16:59
  • I missed the summation. Still, it is not so clear to describe the MLE in that way. The terms in the sum of the gradient steps depend on each other and the score approaches zero as soon as the sum gets close to the MLE. A good way to see that the reasoning is problematic is that the expression $\sum_{t=0}^{t=s} I(\theta_t)$ is a sum of terms with a positive lower limit and diverges for $s \to \infty$. – Sextus Empiricus Feb 16 '24 at 20:10
  • @SextusEmpiricus I realized it diverges too, but technically it's not really infinite. Once it reaches the MLE estimate then the remainder of the summation equals 0. Since the remaining terms are all 0, they are not even close to independent & instead of distribution you get Var(0)=0. And while I realize they may not be independent in general I believe it's an ok approximation. – profPlum Feb 16 '24 at 21:39
  • With or without that expression for the MLE, and the application of the variance sum law, making sense (you will not once reach the MLE in a finite steps, and you are only able to get closer), in any case you have that when we scale a distribution such that the variance becomes smaller, then the gradient becomes bigger. So smaller variance/scale relates to bigger gradient. – Sextus Empiricus Feb 16 '24 at 22:42
  • @SextusEmpiricus I'm not asking about gradient directly. I'm asking how if the MLE estimate is a sum of high variance r.v.s (the score values) is the sum expected to be low variance? – profPlum Feb 26 '24 at 23:13
  • If you make the question focussed about that sum then it is different from the duplicate. The expression is incorrect because the steps are not independent and you evaluate them for a single fixed value of $x$. The score will eventually reach zero in the steps. The Fisher information matrix that you substitute into the equation is the variance of the derivative of the score for a random value of $x$. – Sextus Empiricus Feb 27 '24 at 06:14
  • You can not make that substitution as also the convergence of the expression shows. What you computed with $$ \sum^\infty_{t=0}Var_x({d\over d\theta_t} log(L(\theta_t|x))) = \sum^\infty_{t=0}I(\theta_t)$$ is the value when $\theta_t$ are kept fixed and you draw a new random $x$. – Sextus Empiricus Feb 27 '24 at 06:18

0 Answers0