1

I have 2 continuous variables $(x,y)$ for a dataset $n=186$ that has a strong, positive linear relationship ($m = .46, R^2 = .96, p<.05$). Within this dataset I have 3 different habitat types (B,F,S) for which there is an $x$ and $y$. I have computed individual linear models for each habitat type and the slopes are slightly different, (slope for habitat A is $.50$, slope for habitat B is $.47$, and slope for habitat C is $.43$) I want to know if these slopes are significantly different, thus there are significantly different relationships for x:y depending on habitat.

I have used the following code to model each regression and display coefficients. So how do I now compare coefficients for significance?

lm.type <- data %>%
  group_by(type) %>%
  do(broom::tidy(lm(y ~ x, data)))

I have tried an ANCOVA based on research but I am confused by the outputs. I think I am missing a step.

> ancova.type <- lm(y ~ x * type, data)
> summary(ancova.type)

Call: lm(formula = y ~ x * type, data)

Residuals: Min 1Q Median 3Q Max -14.0231 -1.1883 -0.3191 0.7897 7.1740

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.88251 0.66863 1.320 0.188552
x 0.46685 0.02096 22.273 < 2e-16 *** typeF -0.18040 0.76641 -0.235 0.814177
typeS -2.80182 0.81508 -3.437 0.000729 *** x:typeF -0.03032 0.02415 -1.256 0.210844
x:typeS 0.04212 0.02764 1.524 0.129341


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.276 on 180 degrees of freedom Multiple R-squared: 0.9377, Adjusted R-squared: 0.936 F-statistic: 541.8 on 5 and 180 DF, p-value: < 2.2e-16

  • The p-values are in the final column [ Pr (>|t|) ] – mkt Sep 05 '23 at 18:45
  • https://stats.stackexchange.com/questions/31/what-is-the-meaning-of-p-values-and-t-values-in-statistical-tests – mkt Sep 05 '23 at 18:45
  • Use drop1() to test the null hypothesis that all slopes are identical. – Michael M Sep 05 '23 at 18:49
  • I think the question here is more centered around how to meaningfully interpret the comparison of factor variables in a linear regression, which the links provided so far don't answer directly @mkt – Shawn Hemelstrand Sep 05 '23 at 19:03
  • @ShawnHemelstrand True, I couldn't find exactly what I was after. It must be out there though... – mkt Sep 05 '23 at 19:07
  • 1
    I've changed the title of this question so if somebody cannot find a good duplicate, others will have a better way to find it in the future. – Shawn Hemelstrand Sep 05 '23 at 19:59

1 Answers1

0

I am answering with relation to what I believe your main question is, that is the meaningful interpretation of factor variables in a linear regression. Recall that in a typical Gaussian linear regression:

$$ y = \beta_0 + \beta_1x_1...\beta_nx_n + \epsilon $$

where $\beta$ is a regression coefficient, $x$ is a predictor of interest, and $\epsilon$ is the leftover error term. Here $\beta_0$ signifies the intercept, which can be thought of as a conditional mean. Normally, if we simply enter a factor into a model, the levels of a factor are contrasts against this conditional mean. That probably makes more sense with a specific example. Here I use an example with R's common toy dataset iris, which includes petal and sepal measurements by species of flower. First, we can just get a look at the means here:

library(tidyverse)

iris %>% group_by(Species) %>% summarise(Mean.Petal = mean(Petal.Length))

The means here show that the mean of setosa petal lengths are around $1.46$, versicolor $4.26$, and virginica $5.55$.

# A tibble: 3 × 2
  Species    Mean.Petal
  <fct>           <dbl>
1 setosa           1.46
2 versicolor       4.26
3 virginica        5.55

Now for fitting this data into a regression in R:

fit <- lm(Petal.Length ~ Species, iris)
summary(fit)

You can see the intercept here is simply the lowest mean, here the setosa mean we already looked at, $1.46$:

Call:
lm(formula = Petal.Length ~ Species, data = iris)

Residuals: Min 1Q Median 3Q Max -1.260 -0.258 0.038 0.240 1.348

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.46200 0.06086 24.02 <2e-16 *** Speciesversicolor 2.79800 0.08607 32.51 <2e-16 *** Speciesvirginica 4.09000 0.08607 47.52 <2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4303 on 147 degrees of freedom Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406 F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16

Because these are dummy coded, we can simply enter the slope coefficients into the linear equation to get our means of each flower species, shown below:

#### Means ####
coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3]*0 # setosa
coef(fit)[1] + coef(fit)[2] + coef(fit)[3]*0 # versicolor
coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3] # virginica

You can see they do not differ from the means already obtained:

> coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3]*0 # setosa
(Intercept) 
      1.462 
> coef(fit)[1] + coef(fit)[2] + coef(fit)[3]*0 # versicolor
(Intercept) 
       4.26 
> coef(fit)[1] + coef(fit)[2]*0 + coef(fit)[3] # virginica
(Intercept) 
      5.552 

When entering additional continuous predictors into the model, you simply get the same adjustments to the mean, but now you are controlling for the influence of other predictors in the model. As an example, we can enter both species and petal width here as predictors in the regression.

fit2 <- lm(Petal.Length ~ Species + Petal.Width, iris)
summary(fit2)

Here the summary is slightly different. Now, rather than the raw means, we have adjusted means which account for the extra variable petal width. Notice our intercept and factor variable estimates have changed, but are still quite similar to what they were before. Now they are simply corrected by the influence of petal width. A model such as yours with : operators are simply the interactions between your factor and your continuous variable, which signify if the relationship between those two variables is associated in a meaningful way with each other.

Call:
lm(formula = Petal.Length ~ Species + Petal.Width, data = iris)

Residuals: Min 1Q Median 3Q Max -1.02977 -0.22241 -0.01514 0.18180 1.17449

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.21140 0.06524 18.568 < 2e-16 *** Speciesversicolor 1.69779 0.18095 9.383 < 2e-16 *** Speciesvirginica 2.27669 0.28132 8.093 2.08e-13 *** Petal.Width 1.01871 0.15224 6.691 4.41e-10 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3777 on 146 degrees of freedom Multiple R-squared: 0.9551, Adjusted R-squared: 0.9542 F-statistic: 1036 on 3 and 146 DF, p-value: < 2.2e-16

You really don't need more than this to get a gauge on what the regression is saying about factor variables, but I'm guessing based off your question that you want to apply some form of contrasts to directly test factor-level differences using an inferential test. One way you can achieve this is by using the emmeans package, which applies ANOVA-style pairwise comparisons between factors.

#### Estimated Marginal Means ####
library(emmeans)
e <- emm(fit2)
summary(e)
emm.basic <- emmeans(fit2, specs = pairwise ~ Species)
emm.basic

Interaction Model

fit3 <- lm(Petal.Length ~ Species*Petal.Width, iris) summary(fit3) emm.int <- emmeans(fit3, specs = pairwise ~ Species:Petal.Width) emm.int$contrasts

Here I have used two examples, one which fits the last model fit we explored, and the other which includes a model similar to yours (the interaction model). Here you can see the Tukey pairwise comparisons:

contrast                                                                       estimate
 setosa Petal.Width1.19933333333333 - versicolor Petal.Width1.19933333333333      -2.040
 setosa Petal.Width1.19933333333333 - virginica Petal.Width1.19933333333333       -3.034
 versicolor Petal.Width1.19933333333333 - virginica Petal.Width1.19933333333333   -0.994
    SE  df t.ratio p.value
 0.474 144  -4.306  0.0001
 0.498 144  -6.097  <.0001
 0.175 144  -5.692  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates

Each row here represents a pairwise comparison between each flower species after adjusting for the mean value of petal width $(1.1993)$, with each pairwise comparison here shown as statistically significant. You can see for example that after adjusting for petal width, versicolor has a greater mean than setosa (subtracting its mean from setosa gives you a negative t-value here).

You can also plot these easily with the following code, which I've customized to remove the annoying mean value on the axis. The dots are the estimated marginal means and the blue bands are the confidence intervals.

labels <- c("Virginica","Versicolor","Setosa")

plot(emm.int)+ labs(x="Estimated Marginal Means", y="Species x Petal Width", title = "EMMs of Species x Petal Width")+ theme_minimal()+ scale_y_discrete(labels=labels)

enter image description here