Why do the calculated means in lm() differ from emmeans()?

Question

Let's assume we have the data:

> d <- as.data.frame(list(y = c(10.9510128075768, 9.61076281828162, 9.71566933820043, 
10.8574097780798, 11.7196272991206, 10.2700549009372, 9.57781599021236, 
8.81088670514042, 9.66896702112099, 9.06017067348998, 9.74106741688122, 
10.3943791682216, 9.14814290797614, 12.6491668810949, 10.1560116756651, 
11.1302072674549, 7.71087602015989, 10.7410011571954, 8.68375483954844, 
10.9198036776091, 13.398130155452, 12.5924714207302, 14.3242586301773, 
12.2987683307531, 12.4193856957595, 11.9989278189746, 12.3318213932466, 
13.9451849533731, 13.4337021495452, 14.005159217677, 12.6098813359463, 
13.3763702917746, 13.2441649244865, 11.5737426576175, 14.7784292874755, 
13.1344476609337, 13.7655989991579, 13.955136676909, 12.9494342985577, 
12.694184580233, 13.8936737024255, 11.9527018509387, 14.9713373862242, 
12.6163678937116, 14.654145302275, 14.5122126939506, 13.0829657335859, 
13.5672209148515, 11.9754515204655, 13.3230065030224, 14.0436124583562, 
13.0990784868972, 12.5458630908456, 12.3442181475496, 12.9640775773775, 
14.0691614606768, 12.5160250696987, 12.8789898886726, 11.7058599961792, 
13.4943128360149, 4.30790152011415, 4.49704100940278, 3.81470273089736, 
1.13021120979739, 3.48202950412376, 3.4561356033012, 2.64659971417009, 
3.17048947094798, 2.1359640458731, 3.67923077401566), x1 = c("T1", 
"T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", 
"T1", "T1", "T1", "T1", "T1", "T1", "T1", "T1", "T2", "T2", "T2", 
"T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", 
"T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", 
"T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", 
"T2", "T2", "T2", "T2", "T3", "T3", "T3", "T3", "T3", "T3", "T3", 
"T3", "T3", "T3"), x2 = c(0.672898985346896, -0.569082185143952, 
0.632549243829518, 2.36443492906985, 0.665718635268836, 1.7327500422091, 
1.94658564019779, 1.00439870432637, 0.647677694450151, 0.470304490866496, 
1.7395892255748, -0.0634574154828123, 1.2462108435364, 0.710500633435688, 
-1.26488935648794, -0.408850456073194, 1.91601932879257, 0.808721049464801, 
1.80328321613365, 2.8874744633086, 13.4738811811091, 12.677268492313, 
12.3799626865667, 11.8072015735427, 13.5778917949044, 12.5962341093185, 
10.8264230591286, 11.8443574651097, 10.0810901797302, 11.8047411538894, 
9.40767233005401, 13.3140021671981, 11.3644569989679, 11.5700211613058, 
11.830681667698, 12.6122181739891, 12.6783401772227, 12.5679519724717, 
11.4274573960739, 10.6367087437217, 11.6112777556621, 12.2779141324505, 
11.176918878428, 11.9311590655215, 10.8323376738702, 11.9916909857844, 
12.1288554015974, 11.854124371539, 11.8360890432639, 13.7635520027849, 
12.7625865124183, 13.1114310807306, 11.0767930471702, 12.164341838428, 
13.1548251870973, 11.9434785754735, 9.87063935176535, 12.3448457620995, 
10.0950445544145, 11.1888298468598, 4.3240043212996, 3.61563684930267, 
4.09166895553536, 3.30660486151363, 2.88984123751714, 2.07568722687272, 
4.59291375371919, 3.04501059812188, 2.28487159933212, 3.86522309971714
), ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 
33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L)))

I'm interested in values of y at X1=T1...T3 adjusted for X2.

The means taken from lm():

> summary(lm(y~x1+x2,data=d))
Call:
lm(formula = y ~ x1 + x2, data = d)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.39798 -0.60836  0.01702  0.73175  2.58139
Coefficients:
            Estimate Std. Error t value            Pr(>|t|)

(Intercept) 10.04356    0.24784  40.524 <0.0000000000000002 ***
x1T2         2.72726    1.33580   2.042              0.0452 *

x1T3        -6.92770    0.48213 -14.369 <0.0000000000000002 ***
x2           0.03408    0.11955   0.285              0.7765

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9859 on 66 degrees of freedom
Multiple R-squared:  0.9265,    Adjusted R-squared:  0.9232 
F-statistic: 277.3 on 3 and 66 DF,  p-value: < 0.00000000000000022

are:

> with(as.data.frame(t(coef(lm(y~x1+x2,data=d)))), data.frame(x1T1=`(Intercept)`, x1T2=`(Intercept)`+x1T2, x1T3=`(Intercept)`+x1T3))
      x1T1     x1T2    x1T3
1 10.04356 12.77083 3.11586

This differs from raw means, so I assume this is corrected for X2:

> tapply(d$y, d$x1, mean)
       T1        T2        T3 
10.075839 13.175987  3.232031

So why the LS-Means are different from the lm()? Isn't lm() corrected for X2?

emmeans(lm(y~x1+x2,data=d), "x1")
 x1 emmean    SE df lower.CL upper.CL
 T1  10.30 0.820 66     8.66    11.94
 T2  13.03 0.542 66    11.95    14.11
 T3   3.37 0.585 66     2.20     4.54

So what are the means taken from lm(), if not raw means and not LS-means? I am interested in the means of y at each time point, corrected for x2

I think I found the answer: Why are emmeans package means different than regular means?

So lm() gives me the ordinary marginal means, based on data, and EM - estimated ones, based on a model. But which one are better in my case? If you are asked to report adjusted means, would you pick those form lm or emmeans? I believe SAS users choose LS-means (emmeans), while R users lm()? Or is there any good guideline?

I don't think that @RichardHardy's posting has any relevance to this question; because this question involves a single factor and a covariate, and Hardy's posting expresses concerns about the way that EMMs average over factor levels when there are two or more factors involved. — Russ Lenth, Dec 10 '19 at 02:58
@rvl, thanks; I did not intend it as a duplicate but as a side reference. — Richard Hardy, Dec 10 '19 at 06:20
RE: "Isn't lm() corrected for X2?" It is not corrected for X2, but you set X2 to zero. — Frans Rodenburg, Dec 11 '19 at 05:35

score 5 · Accepted Answer · answered Dec 10 '19 at 01:10

EMMs are a linear function of the regression coefficients. You can always tell what that linear function is by looking at the linfct slot of the object that emmeans() creates:

> mod = lm(formula = y ~ x1 + x2, data = d)
> (emm = emmeans(mod, "x1"))
 x1 emmean    SE df lower.CL upper.CL
 T1  10.30 0.820 66     8.66    11.94
 T2  13.03 0.542 66    11.95    14.11
 T3   3.37 0.585 66     2.20     4.54

Confidence level used: 0.95 
> emm@linfct
     (Intercept) x1T2 x1T3       x2
[1,]           1    0    0 7.551851
[2,]           1    1    0 7.551851
[3,]           1    0    1 7.551851

Note that the value for x2 is just its overall mean:

> mean(d$x2)
[1] 7.551851

In fact, in this example, the EMMs are just predictions for the three x1 levels at the mean of x2:

> predict(mod, newdata = data.frame(x1 = c("T1", "T2", "T3"), x2 = 7.551851))
        1         2         3 
10.300902 13.028164  3.373198

These are also known as the "adjusted means" in the context of analysis of covariance.

Why do the calculated means in lm() differ from emmeans()?

1 Answers1