R lm covariance matrix manual calculation: failure

Question

I want to get the variance-covariance matrix manually, then take it's diagonal elements to derive coefficient standard errors, according to:

sqrt(diag(vcov(lm_obj)))

My code:

dummy_data <- data.frame(
     x1 = rnorm(100, 0, 1),
     x2 = rnorm(100, 0, 1),
     y = round(runif(100, 0, 1), 0)
)


dummy_glm <- lm(y ~ x1 + x2,
                 data = dummy_data
                )

summary(dummy_glm)

B <- as.matrix(dummy_glm$coefficients, ncol = 1)

X <- as.matrix(cbind(rep(1, times = 100), dummy_data[, c('x1', 'x2')]), ncol = 3)

## manually do regression calc

XtX <- t(X) %*% X

det(XtX) # > 0 hence there is inverse

XtX_inv <- solve(XtX)

XtX_inv_diag <- diag(XtX_inv)

# standard errors for each coefficient

st_err_B <- sqrt(XtX_inv_diag)

st_err_B_alternative <- sqrt(diag(vcov(dummy_glm)))

vcov(dummy_glm) == XtX_inv

I found my calculation of XtX_inv does not match vcov(dummy_glm). Could you please hint to a mistake?

Hint: Don't you think the standard errors of the estimates ought to have something to do with the response vector y? — whuber, Aug 25 '17 at 16:50
Sounds reasonable. I think I forgot to add the residuals: XtX_inv <- solve(XtX) * var(dummy_glm$residuals). In this case I get very close (but not perfect) to the vcov() result, but I guess there is something hidden that has something to do with the number of coefficients... — Alexey Burnakov, Aug 25 '17 at 17:01
That's right. You might find it quickest to consult the code rather than look up the formula. vcov.lm refers to summary.lm which shows the role of rdf ("residual degrees of freedom") in the estimate R makes. — whuber, Aug 25 '17 at 17:02
I got it. XtX_inv <- solve(XtX) * sum(dummy_glm$residuals ^ 2) / (nrow(dummy_data) - ncol(dummy_data) + 1 - 1) # n - m - 1 (mean squared errors). So there is not just a variance, but assumption that expectation of residuals equals zero, and we calculate a mean of squared errors, taking care of the number of coefficients. Perfect fit this time. Thanks for the help! — Alexey Burnakov, Aug 25 '17 at 17:10
Very good! That's basically right, but if you are ever inclined to use such a formula, you will need to take more care in counting the variables. Two of the traps awaiting you are (1) dealing with collinear variables--many of which R will automatically detect and remove--and (2) counting categorical variables. That's why the variable count is originally computed by the fitting procedure (such as lm) and passed along to other procedures like summary.lm and vcov.lm. You will notice, too, that solve is not invoked: for numeric accuracy, R uses a QR decomposition performed by lm. — whuber, Aug 25 '17 at 17:18
I never knew that. Thank you for this insight. I have taken a dive in this math not just for student home work. I actually wanted to show my Python-colleague how to compute the covariance matrix, since they do not have any of vcov.lm or lm, or summary.lm. It appears I learned myself a lot. — Alexey Burnakov, Aug 25 '17 at 17:22

score 3 · Accepted Answer · answered Aug 25 '17 at 17:47

Solution for linear regression (lm):

XtX_inv <- solve(XtX) * sum(dummy_glm$residuals ^ 2) / (nrow(dummy_data) - ncol(dummy_data) + 1 - 1)

Solution for logistic regression (glm(...family = binomial(link = 'logit')):

pi <- dummy_glm$fitted.values

w <- pi * (1 - pi)

v <- diag(w, length(w), length(w))

XtX_inv <- solve(t(X) %*% v %*% X)

These seem to match almost perfectly with the results by vcov.

R lm covariance matrix manual calculation: failure

1 Answers1

Linked