3

Here is a simple example in R using the iris dataset:

summary(lm(Sepal.Length ~ Species, data=iris))

Output is:

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals: Min 1Q Median 3Q Max -1.6880 -0.3285 -0.0060 0.3120 1.3120

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 *** Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 *** Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5148 on 147 degrees of freedom Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135 F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16

Why is the Std. Error for Speciesversicolor the same as the Std. Error for Speciesvirginica (0.1030)? I cannot get my head around that...

1 Answers1

8

The default linear model implied by lm assumes that the residual standard deviation in sepal length is the same for all plants. Under this assumption, the residual standard deviation for any plant is estimated as 0.5148, and so the standard error for the difference between two groups is

$$ 0.5148\times\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$$

For all of the pairwise comparisons in this dataset, $n_1=n_2=50$, and so the standard error is the same $(0.5148\times\sqrt{2/50}=0.103)$.

If you were to estimate the difference between setosa and versicolor separately (without the virginica data) then it would have a different standard error based only on the residual standard deviation in those species. As it is in your model, the setosa vs versicolor test is being influenced by the variation in the viriginca group! This isn't necessarily wrong, but it's important to understand this and to be happy with the assumption that the pooled standard deviation can be found in this way.

This is partly why we say that equal residual variances across all groups is an important assumption for linear regression. If its not true, even in some groups, it can affect all of the tests based on that model.

Is it OK in this case? The standard deviations across groups do vary here, but not enormously. There may be a case for a log-transformation before estimating this model which would stabilise the variances across groups, or the use of a different model.

> aggregate(data=iris, Sepal.Length ~ Species, sd)
     Species Sepal.Length
1     setosa    0.3524897
2 versicolor    0.5161711
3  virginica    0.6358796

enter image description here

George Savva
  • 2,290
  • Good point. I will clarify that I meant the linear model implied by lm. – George Savva Mar 08 '23 at 16:15
  • 2
    I would say the default homoskedastic errors assumption implies that the residual standard deviation in sepal length is the same for all plants. You can estimate a linear model with heteroskedastic robust standard errors which will allow the variance of the error term to vary by plant (and hence you will almost certainly then obtain different standard errors for your coefficients).

    (I previously posted this comment, but it had typos.)

    – Matthew Gunn Mar 08 '23 at 17:07