How to apply coefficient term for factors and interactive terms in a linear equation?

Question

Using R, I have fitted a linear model for a single response variable from a mix of continuous and discrete predictors. This is uber-basic, but I'm having trouble grasping how a coefficient for a discrete factor works.

Concept: Obviously, the coefficient of the continuous variable 'x' is applied in the form y = coefx(varx) + intercept but how does that work for a factor z if the factor is non-numeric? y = coefx(varx) + coefz(factorz???) + intercept

Specific: I have fitted a model in R as lm(log(c) ~ log(d) + h + a + f + h:a) where h and f are discrete, non-numeric factors. The coefficients are:

Coefficients:
              Estimate 
(Intercept)  -0.679695 
log(d)        1.791294 
h1            0.870735  
h2           -0.447570  
h3            0.542033   
a             0.037362  
f1           -0.588362  
f2            0.816825 
f3            0.534440
h1:a         -0.085658
h2:a         -0.034970 
h3:a         -0.040637

How do I use these to create the predictive equation:

log(c) =  1.791294(log(d)) + 0.037362(a) + h??? + f???? + h:a???? + -0.679695

Or am I doing it wrong?

I THINK that that concept is if the subject falls in category h1 and f2, the equation becomes:

log(c) =  1.791294(log(d)) + 0.037362(a) +  0.870735  + 0.816825  + h:a???? + -0.679695

But I'm really not clear on how the h:a interactive term gets parsed. Thanks for going easy on me.

whuber · Accepted Answer · 2012-03-07T20:19:11.753

This is not a problem specific to R. R uses a conventional display of coefficients.

When you read such regression output (in a paper, textbook, or from statistical software), you need to know which variables are "continuous" and which are "categorical":

The "continuous" ones are explicitly numeric and their numeric values were used as-is in the regression fitting.
The "categorical" variables can be of any type, including those that are numeric! What makes them categorical is that the software treated them as "factors": that is, each distinct value that is found is considered an indicator of something distinct.

Most software will treat non-numerical values (such as strings) as factors. Most software can be persuaded to treat numerical values as factors, too. For example, a postal service code (ZIP code in the US) looks like a number but really is just a code for a set of mailboxes; it would make no sense to add, subtract, and multiply ZIP codes by other numbers! (This flexibility is the source of a common mistake: if you are not careful, or unwitting, your software may treat a variable you consider to be categorical as continuous, or vice-versa. Be careful!)

Nevertheless, categorical variables have to be represented in some way as numbers in order to apply the fitting algorithms. There are many ways to encode them. The codes are created using "dummy variables." Find out more about dummy variable encoding by searching on this site; the details don't matter here.

In the question we are told that h and f are categorical ("discrete") values. By default, log(d) and a are continuous. That's all we need to know. The model is

$$\eqalign{ y &= \color{red}{-0.679695} & \\ &+ \color{RoyalBlue}{1.791294}\ \log(d) \\ &+ 0.870735 &\text{ if }h=h_1 \\ & -0.447570 &\text{ if }h=h_2 \\ &+ \color{green}{0.542033} &\text{ if }h=h_3 \\ &+ \color{orange}{0.037362}\ a \\ & -0.588362 &\text{ if }f=f_1 \\ &+ \color{purple}{0.816825} &\text{ if }f=f_2 \\ &+ 0.534440 &\text{ if }f=f_3 \\ & -0.085658\ a &\text{ if }h=h_1 \\ & -0.034970\ a &\text{ if }h=h_2 \\ & -\color{brown}{0.040637}\ a &\text{ if }h=h_3 \\ }$$

The rules applied here are:

The "intercept" term, if it appears, is an additive constant (first line).
Continuous variables are multiplied by their coefficients, even in "interactions" like the h1:a, h2:a, and h3:a terms. (This answers the original question.)
Any categorical variable (or factor) is included only for cases where the value of that factor appears.

For example, suppose that $\log(d)=2$, $h=h_3$, $a=-1$, and $f=f_2$. The fitted value in this model is

$$\hat{y} = \color{red}{-0.6797} + \color{RoyalBlue}{1.7913}\times (2) + \color{green}{0.5420} + \color{orange}{0.0374}\times (-1) + \color{purple}{0.8168} -\color{brown}{0.0406}\times (-1).$$

Notice how most of the model coefficients simply do not appear in the calculation, because h can take on exactly one of the three values $h_1$, $h_2$, $h_3$ and therefore only one of the three coefficients $(0.870735, -0.447570, 0.542033)$ applies to h and only one of the three coefficients $(-0.085658, -0.034970, -0.040637)$ will multiply a in the h:a interaction; similarly, only one coefficient applies to f in any particular case.

score 8 · Answer 2 · answered Mar 07 '12 at 16:16

This is just a comment but it won't fit as such in the limited edit boxes we have at our disposal.

I like seeing a regression equation clearly written in plain text, as @whuber did in his reply. Here is a quick way to this in R, with the Hmisc package. (I'll be using rms too, but that does not really matter.) Basically, it only assumes that a $\LaTeX$ typesetting system is available on your machine.

Let's simulate some data first,

n <- 200
x1 <- runif(n)
x2 <- runif(n)
x3 <- runif(n)
g1 <- gl(2, 100, n, labels=letters[1:2])
g2 <- cut2(runif(n), g=4)
y <- x1 + x2 + rnorm(200)

then fit a regression model,

f <- ols(y ~ x1 + x2 + x3 + g1 + g2 + x1:g1)

which yields the following results:

Linear Regression Model

ols(formula = y ~ x1 + x2 + x3 + g1 + g2 + x1:g1)

                Model Likelihood     Discrimination    
                   Ratio Test           Indexes        
Obs      200    LR chi2     35.22    R2       0.161    
sigma 0.9887    d.f.            8    R2 adj   0.126    
d.f.     191    Pr(> chi2) 0.0000    g        0.487    

Residuals

    Min      1Q  Median      3Q     Max 
-3.1642 -0.7109  0.1015  0.7363  2.7342 

                   Coef    S.E.   t     Pr(>|t|)
Intercept           0.0540 0.2932  0.18 0.8541  
x1                  1.1414 0.3642  3.13 0.0020  
x2                  0.8546 0.2331  3.67 0.0003  
x3                 -0.0048 0.2472 -0.02 0.9844  
g1=b                0.2099 0.2895  0.73 0.4692  
g2=[0.23278,0.553)  0.0609 0.1988  0.31 0.7598  
g2=[0.55315,0.777) -0.2615 0.1987 -1.32 0.1896  
g2=[0.77742,0.985] -0.2107 0.1986 -1.06 0.2901  
x1 * g1=b          -0.2354 0.5020 -0.47 0.6396

Then, to print the corresponding regression equation, just use the generic latex function, like this:

latex(f)

Upon conversion of the dvi to png, you should get something like that

enter image description here

IMO, this has the merit of showing how to compute predicted values depending on actual or chosen values for numerical and categorical predictors. For the latter, factor levels are indicated in bracket near the corresponding coefficient.

+1 That's a nice capability. The syntax of terms like $+0.2099013{b}$, though, is potentially confusing: there is no evident relationship between this expression and the categorical variable g1, nor is it entirely evident that ${b}$ really stands for an indicator that $g_1=b$ rather than for the numerical value of $b$! (Here, $b$ really means "b"--the letter--which may be sufficient warning, but when the categories are coded by numbers, such as $0$ and $1$, watch out...) — whuber, Mar 07 '12 at 17:46
@whuber The above image has been cropped but there's sort of a footnote recalling that "{c} = 1 if subject is in group c, 0 otherwise" (the choice of c might be confusing in this particular case, because I choose two letters to represent g1 levels, but usually it's quite intuitive--and that's pure tex so we can still edit the source file afterwards). Attached is another summary where I altered g1 so that it is now a four-level factor. Yet, with 0/1 labels that might be more confusing. — chl, Mar 07 '12 at 19:35

Peter Ellis · Answer 3 · 2012-03-07T13:16:15.720

You can check your "contrasts" are the default by options() and looking for:

$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

If your unordered contrasts are set as contr.treatment (as they should be unless you've changed them), then the first level of each of your factors will be set as a baseline. You will only be given estimates for the coefficients in front of the dummy variables created for other levels of the factor. In effect, those coefficients will be "how different on average is the response variable at this level of the factor, compared to the baseline level of the factor, having controlled for everything else in the model".

I am guessing from your output there is a an h0 and f0 which are the baseline levels for h and f (unless you have a non-default option for contrasts, in which case there are several possibilities; try ?contr.treatment for some help).

It's similar with the interaction. If my previous paragraph is correct, the estimate given for a will really be the slope for a when h=h0. The estimates given in the summary that apply to the interactions are how much that slope changes for different levels of h.

So in your example where h=h1 and f=f2, try:

log(c) =  1.791294(log(d)) + (0.037362 - 0.085658) (a) +  0.870735  + 0.816825  -0.679695

Oh, and you can use predict() to do a lot of useful things too... if you actually want to predict something (rather than write out the equation for a report). Try ?predict.lm to see what predict() does to an object created by lm.

+1 (actually, I upvoted this a month ago & just happen to be back rereading it now) at any rate, it occurs to me that you recommend checking the contrast type by options(). You will have to scroll through a lot of junk to find what you need. You might try options()$contrasts, which will only output what you want. — gung - Reinstate Monica, Apr 10 '12 at 01:03
You know, I often answer CV questions right before I go to bed... — gung - Reinstate Monica, Apr 10 '12 at 02:48

Kaleb Coberly · Answer 4 · 2020-12-20T22:01:25.980

Rather than thinking of some of the coefficients being included and some not, resulting in a number of different equations depending on the values of the variables, another way to think about it is that all coefficients are included in a single equation. But, they are multiplied by either 1 or 0 depending on whether that condition is true.

That is, each possible value of each factor variable (i.e. discrete variable) appears in the final equation as either a 0 or a 1 depending on whether the variable has that value, and its coefficient is applied to it.

So, as already mentioned, each factor variable is split into dummy variables, one for each level in the factor (e.g. h becomes h1, h2, h3, ... hn), representing all possible conditions of the variable. Then, each dummy variable gets its own unique coefficient.

So lm(log(c) ~ log(d) + h + a + f + h:a) becomes

Coefficients:
              Estimate 
(Intercept)  -0.679695 
log(d)        1.791294 
h1            0.870735  
h2           -0.447570  
h3            0.542033   
a             0.037362  
f1           -0.588362  
f2            0.816825 
f3            0.534440
h1:a         -0.085658
h2:a         -0.034970 
h3:a         -0.040637

which becomes

log(c) == -0.679695 + 1.791294*log(d) + 0.870735*h1 - 0.447570*h2 + 0.542033*h3 + 0.037362*a - 0.588362*f1 + 0.816825*f2 + 0.534440*f3 - 0.085658*h1*a - 0.034970*h2*a - 0.040637*h3*a

Now plug in ones and zeroes:

If h equals h1, then h1 equals 1 (or TRUE), else h1 equals 0 ( or FALSE).

If h equals h2, then h2 equals 1, else h2 equals 0.

... etc.

How to apply coefficient term for factors and interactive terms in a linear equation?

4 Answers4

Linked

Related