How to compare linear regression coefficients with factors in R?

Question

I have 2 continuous variables $(x,y)$ for a dataset $n=186$ that has a strong, positive linear relationship ($m = .46, R^2 = .96, p<.05$). Within this dataset I have 3 different habitat types (B,F,S) for which there is an $x$ and $y$. I have computed individual linear models for each habitat type and the slopes are slightly different, (slope for habitat A is $.50$, slope for habitat B is $.47$, and slope for habitat C is $.43$) I want to know if these slopes are significantly different, thus there are significantly different relationships for x:y depending on habitat.

I have used the following code to model each regression and display coefficients. So how do I now compare coefficients for significance?

lm.type <- data %>%
  group_by(type) %>%
  do(broom::tidy(lm(y ~ x, data)))

I have tried an ANCOVA based on research but I am confused by the outputs. I think I am missing a step.

> ancova.type <- lm(y ~ x * type, data)
> summary(ancova.type)
Call:
lm(formula = y ~ x * type, data)
Residuals:
     Min       1Q   Median       3Q      Max 
-14.0231  -1.1883  -0.3191   0.7897   7.1740
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept)  0.88251    0.66863   1.320 0.188552

x            0.46685    0.02096  22.273  < 2e-16 ***
typeF       -0.18040    0.76641  -0.235 0.814177

typeS       -2.80182    0.81508  -3.437 0.000729 ***
x:typeF     -0.03032    0.02415  -1.256 0.210844

x:typeS      0.04212    0.02764   1.524 0.129341

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.276 on 180 degrees of freedom
Multiple R-squared:  0.9377,    Adjusted R-squared:  0.936 
F-statistic: 541.8 on 5 and 180 DF,  p-value: < 2.2e-16

https://stats.stackexchange.com/questions/31/what-is-the-meaning-of-p-values-and-t-values-in-statistical-tests — mkt, Sep 05 '23 at 18:45
Use drop1() to test the null hypothesis that all slopes are identical. — Michael M, Sep 05 '23 at 18:49
I think the question here is more centered around how to meaningfully interpret the comparison of factor variables in a linear regression, which the links provided so far don't answer directly @mkt — Shawn Hemelstrand, Sep 05 '23 at 19:03
@ShawnHemelstrand True, I couldn't find exactly what I was after. It must be out there though... — mkt, Sep 05 '23 at 19:07
I've changed the title of this question so if somebody cannot find a good duplicate, others will have a better way to find it in the future. — Shawn Hemelstrand, Sep 05 '23 at 19:59

Shawn Hemelstrand · Answer 1 · 2023-09-05T21:43:58.523

I am answering with relation to what I believe your main question is, that is the meaningful interpretation of factor variables in a linear regression. Recall that in a typical Gaussian linear regression:

$$ y = \beta_0 + \beta_1x_1...\beta_nx_n + \epsilon $$

where $\beta$ is a regression coefficient, $x$ is a predictor of interest, and $\epsilon$ is the leftover error term. Here $\beta_0$ signifies the intercept, which can be thought of as a conditional mean. Normally, if we simply enter a factor into a model, the levels of a factor are contrasts against this conditional mean. That probably makes more sense with a specific example. Here I use an example with R's common toy dataset iris, which includes petal and sepal measurements by species of flower. First, we can just get a look at the means here:

library(tidyverse)
iris %>% 
  group_by(Species) %>% 
  summarise(Mean.Petal = mean(Petal.Length))

The means here show that the mean of setosa petal lengths are around $1.46$, versicolor $4.26$, and virginica $5.55$.

# A tibble: 3 × 2
  Species    Mean.Petal
  <fct>           <dbl>
1 setosa           1.46
2 versicolor       4.26
3 virginica        5.55

Now for fitting this data into a regression in R:

fit <- lm(Petal.Length ~ Species, iris)
summary(fit)

You can see the intercept here is simply the lowest mean, here the setosa mean we already looked at, $1.46$:

Call:
lm(formula = Petal.Length ~ Species, data = iris)
Residuals:
   Min     1Q Median     3Q    Max 
-1.260 -0.258  0.038  0.240  1.348
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)

(Intercept)        1.46200    0.06086   24.02   <2e-16 ***
Speciesversicolor  2.79800    0.08607   32.51   <2e-16 ***
Speciesvirginica   4.09000    0.08607   47.52   <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4303 on 147 degrees of freedom
Multiple R-squared:  0.9414,    Adjusted R-squared:  0.9406 
F-statistic:  1180 on 2 and 147 DF,  p-value: < 2.2e-16

Because these are dummy coded, we can simply enter the slope coefficients into the linear equation to get our means of each flower species, shown below:

#### Means ####
coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3]*0 # setosa
coef(fit)[1] + coef(fit)[2] + coef(fit)[3]*0 # versicolor
coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3] # virginica

You can see they do not differ from the means already obtained:

> coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3]*0 # setosa
(Intercept) 
      1.462 
> coef(fit)[1] + coef(fit)[2] + coef(fit)[3]*0 # versicolor
(Intercept) 
       4.26 
> coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3] # virginica
(Intercept) 
      5.552

When entering additional continuous predictors into the model, you simply get the same adjustments to the mean, but now you are controlling for the influence of other predictors in the model. As an example, we can enter both species and petal width here as predictors in the regression.

fit2 <- lm(Petal.Length ~ Species + Petal.Width, iris)
summary(fit2)

Here the summary is slightly different. Now, rather than the raw means, we have adjusted means which account for the extra variable petal width. Notice our intercept and factor variable estimates have changed, but are still quite similar to what they were before. Now they are simply corrected by the influence of petal width. A model such as yours with : operators are simply the interactions between your factor and your continuous variable, which signify if the relationship between those two variables is associated in a meaningful way with each other.

Call:
lm(formula = Petal.Length ~ Species + Petal.Width, data = iris)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.02977 -0.22241 -0.01514  0.18180  1.17449
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)

(Intercept)        1.21140    0.06524  18.568  < 2e-16 ***
Speciesversicolor  1.69779    0.18095   9.383  < 2e-16 ***
Speciesvirginica   2.27669    0.28132   8.093 2.08e-13 ***
Petal.Width        1.01871    0.15224   6.691 4.41e-10 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3777 on 146 degrees of freedom
Multiple R-squared:  0.9551,    Adjusted R-squared:  0.9542 
F-statistic:  1036 on 3 and 146 DF,  p-value: < 2.2e-16

You really don't need more than this to get a gauge on what the regression is saying about factor variables, but I'm guessing based off your question that you want to apply some form of contrasts to directly test factor-level differences using an inferential test. One way you can achieve this is by using the emmeans package, which applies ANOVA-style pairwise comparisons between factors.

#### Estimated Marginal Means ####
library(emmeans)
e <- emm(fit2)
summary(e)
emm.basic <- emmeans(fit2, specs = pairwise ~ Species)
emm.basic
Interaction Model
fit3 <- lm(Petal.Length ~ Species*Petal.Width, iris)
summary(fit3)
emm.int <- emmeans(fit3, specs = pairwise ~ Species:Petal.Width)
emm.int$contrasts

Here I have used two examples, one which fits the last model fit we explored, and the other which includes a model similar to yours (the interaction model). Here you can see the Tukey pairwise comparisons:

contrast                                                                       estimate
 setosa Petal.Width1.19933333333333 - versicolor Petal.Width1.19933333333333      -2.040
 setosa Petal.Width1.19933333333333 - virginica Petal.Width1.19933333333333       -3.034
 versicolor Petal.Width1.19933333333333 - virginica Petal.Width1.19933333333333   -0.994
    SE  df t.ratio p.value
 0.474 144  -4.306  0.0001
 0.498 144  -6.097  <.0001
 0.175 144  -5.692  <.0001
P value adjustment: tukey method for comparing a family of 3 estimates

Each row here represents a pairwise comparison between each flower species after adjusting for the mean value of petal width $(1.1993)$, with each pairwise comparison here shown as statistically significant. You can see for example that after adjusting for petal width, versicolor has a greater mean than setosa (subtracting its mean from setosa gives you a negative t-value here).

You can also plot these easily with the following code, which I've customized to remove the annoying mean value on the axis. The dots are the estimated marginal means and the blue bands are the confidence intervals.

labels <- c("Virginica","Versicolor","Setosa")
plot(emm.int)+
  labs(x="Estimated Marginal Means",
       y="Species x Petal Width",
       title = "EMMs of Species x Petal Width")+
  theme_minimal()+
   scale_y_discrete(labels=labels)

How to compare linear regression coefficients with factors in R?

1 Answers1

Interaction Model