5

Someone asked a question on Stack Overflow where they noted a difference between Minitab and R (glm) results for the variance-covariance matrix of the parameters, for a probit-link binomial model.

  • Minitab's algorithm is not well described, but fitting the same model in glmmTMB (which uses general MLE rather than IRLS, and inverts the Hessian estimated from finite differences of the (automatically generated) gradient at the MLE) gives very close answers to the Minitab result.
  • The coefficient estimates are very similar
  • glm() instead uses the QR decomposition based on the weighted least-squares solution at the last step of an IRLS algorithm (see here, here).
  • The data set is very small (2 parameters estimated from 4 observations), so I'd expect any asymptotic assumptions to be pretty poor
  • refitting the same model with the canonical logit link instead of the probit gives very similar answers between glm (IRLS) and glmmTMB (MLE/Hessian)

I know that the Fisher information matrix is equivalent to the Hessian only in the case of a canonical link (see here); I believe that optimization based on these two approaches reaches the same point, but perhaps the estimate of the covariance matrices still differs?

Ben Bolker
  • 43,543
  • Note that there is some difference in the estimates (~0.00013 for b1). While the difference of the estimates may be fine for the point estimates, it may be enough to have a significant difference in the variance estimates? – Cliff AB Dec 17 '23 at 18:30
  • Maybe, but the differences seemed too large to be accounted for by that variation (not a proof or even a demonstration, just a feeling). (I haven't checked the glm() vs glmmTMB() differences in point estimates for the logit case, where the covariance matrices are very similar ...) – Ben Bolker Dec 17 '23 at 18:30
  • 1
    Yeah, I tried just changing episolon=1e-20 for glm, and while this moves the point estimates significantly closer to Minitab's in terms of relative difference, it moves the the vcov differences in a very minimal amount in terms of relative difference. So I don't think it's just differences in precision. – Cliff AB Dec 17 '23 at 18:41

1 Answers1

5

The likelihood score for a glm is $$U_i= x_i\frac{d\eta}{d\mu}V^{-1}(\mu)(y-\mu)=D^TV^{-1}S.$$ The derivative of this product has three components, two of which are multiples of $S$ and so are mean zero and the third of which is $D^TV^{-1}D$. The expected Fisher information is thus $D^TV^{-1}D$, but the observed Fisher information (Hessian) is more complicated.

R (following GLIM and McCullagh & Nelder) uses the expected information. If glmmTMB uses the Hessian directly then it will be getting the observed Fisher information.

The point estimates are the same either way: they are the values that solve $\sum_i U_i(\beta)=0$. The variance estimates will not be identical; they will be different consistent estimates of the same true value.

Efron and Hinkley argued that the observed information should give better confidence intervals than the expected information. The difference is pretty small and in most settings people use whichever one is easier to compute -- just as you see here, where the expected information is easier to compute in the context of glm and the observed information is easier in the context of automatic differentiation software designed for more general MLEs.

Thomas Lumley
  • 38,062