0

I would like to estimate the variance of the error term in the ordinary linear regression model. The obvious estimate is the sample variance of the residuals, however this estimate consistently underestimates the real value. A simple simulation in R confirms that:

    x = rep(0:1, c(10, 10))
    a = replicate(1000, var(residuals(lm(rnorm(20, mean = x, 
                          sd = 1) ~ x)))); mean(a)
    0.9418632

This is of course makes sense, since when optimizing a linear model we are actually optimizing this error and the real generating model would always give somewhat bigger errors. However, is there a good way to get an unbiased estimate for this error? One possibility is probably using some kind of cross-validation scheme, but is there any way to calculate this analytically?

Raivo Kolde
  • 163
  • 1
  • 5
  • 4
    Hint: have you ever noticed the "residual standard error" part of lm's summary output? It is not the square root of the variance of the residuals. It's computed with the command resvar <- rss/(n - p) in summary.lm, where p is the "degrees of freedom" reported, rather than as rss/(n - 1), which is what var is doing. – whuber Sep 11 '12 at 18:40
  • 1
    Cool! At least on the example above this works. Thanks! – Raivo Kolde Sep 11 '12 at 18:47
  • It is right. It will work for any regression problem. – Michael R. Chernick Sep 11 '12 at 18:54

0 Answers0