R linear regression categorical variable "hidden" value

Question

This is just an example that I have come across several times, so I don't have any sample data. Running a linear regression model in R:

a.lm = lm(Y ~ x1 + x2)

x1 is a continuous variable. x2 is categorical and has three values e.g. "Low", "Medium" and "High". However the output given by R would be something like:

summary(a.lm)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.521     0.20       1.446   0.19        
x1            -0.61     0.11       1.451   0.17
x2Low         -0.78     0.22       -2.34   0.005
x2Medium      -0.56     0.45       -2.34   0.005

I understand that R introduces some sort of dummy coding on such factors (x2 being a factor). I'm just wondering, how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here?

I've seen examples of this elsewhere (e.g. here) but haven't found an explanation I could understand.

You may get a good answer here, but I'm going to flag this for migration to stats.SE, as the answer to this question essentially boils down to understanding how linear regression works. — joran, Apr 15 '12 at 18:53
Yeah that's fair enough. Would it be better if I deleted it and moved it myself? Or is that unnecessary? — , Apr 15 '12 at 19:02
You shouldn't need to do anything. I flagged it, but it may take an hour or two before a mod gets to it, it being a Sunday and all. — joran, Apr 15 '12 at 20:15
I won't provide an answer here, because the question will be moved. But you can try a few things to understand what's going on: 1. run lm( Y ~x1 + x2 - 1). the "-1" will remove the intercept. 2. use relevel to change the reference category of x2. — Manoel Galdino, Apr 15 '12 at 21:06

DWin · Accepted Answer · 2016-01-21T18:01:13.013

Q: " ... how do I interpret the x2 value "High"? For example, what effect does "High" x2s have on the response variable in the example given here??

A: You have no doubt noticed that there is no mention of x2="High" in the output. At the moment x2High is chosen as the "base case". That's because you offered a factor variable with the default coding for levels despite an ordering that would have been L/M/H more naturally to the human mind. But "H" being lexically before both "L" and "M" in the alphabet, was chosen by R as the base case.

Since 'x2' was not ordered, each of the reported contrasts were relative to x2="High" and so x2=="Low" was estimated at -0.78 relative to x2="High". At the moment the Intercept is the estimated value of "Y" when x2="High" and x1= 0. You probably want to re-run your regression after changing the levels ordering (but not making the factor ordered).

x2a = factor(x2, levels=c("Low", "Medium", "High"))

Then your 'Medium' and 'High' estimate will be more in line with what you expect.

Edit: There are alternative coding arrangements (or more accurately arrangements of the model matrix.) The default choice for contrasts in R is "treatment contrasts" which specifies one factor level (or one particular combination of factor levels) as the reference level and reports estimated mean differences for other levels or combinations. You can, however have the reference level be the overall mean by forcing the Intercept to be 0 (not recommended) or using one of the other contrast choices:

?contrasts
?C   # which also means you should _not_ use either "c" or "C" as variable names.

You can choose different contrasts for different factors, although doing so would seem to impose an additional interpretive burden. S-Plus uses Helmert contrasts by default, and SAS uses treatment contrasts but chooses the last factor level rather than the first as the reference level.

That makes sense. I suppose obviously x2 couldn't have "no value" since it must be one of "High", "Medium" or "Low". Thanks for your answer. — , Apr 16 '12 at 12:42

R linear regression categorical variable "hidden" value

1 Answers1

Linked

Related