I would like to estimate the variance of the error term in the ordinary linear regression model. The obvious estimate is the sample variance of the residuals, however this estimate consistently underestimates the real value. A simple simulation in R confirms that:
x = rep(0:1, c(10, 10))
a = replicate(1000, var(residuals(lm(rnorm(20, mean = x,
sd = 1) ~ x)))); mean(a)
0.9418632
This is of course makes sense, since when optimizing a linear model we are actually optimizing this error and the real generating model would always give somewhat bigger errors. However, is there a good way to get an unbiased estimate for this error? One possibility is probably using some kind of cross-validation scheme, but is there any way to calculate this analytically?
lm's summary output? It is not the square root of the variance of the residuals. It's computed with the commandresvar <- rss/(n - p)insummary.lm, wherepis the "degrees of freedom" reported, rather than asrss/(n - 1), which is whatvaris doing. – whuber Sep 11 '12 at 18:40