In a case-control study including men and women of various ages, I wish to investigate if there is a difference in a measured variable (X) between cases and controls. The data are stored in a dataframe/tibble d as such:
# A tibble: 1,103 × 4
CaCo Gender Age X
<fctr> <fctr> <dbl> <dbl>
1 Case Woman 59 1.225700
2 Case Woman 61 1.153512
3 Case Woman 50 1.125951
4 Case Woman 30 1.316410
5 Case Man 28 1.248292
6 Case Man 52 1.226141
7 Case Woman 45 1.332503
8 Case Man 31 1.272777
9 Case Man 30 1.150000
10 Case Woman 41 1.186069
# ... with 1,093 more rows
xtabs(~ CaCo + Gender, data = d)
Gender
CaCo Man Woman
Control 401 271
Case 256 175
The reference category for the CaCo-term is Control and for the Gender-term it is Man.
I use linear regression lm in R to apply model m1:
#-----
Call:
lm(formula = "X ~ CaCo + Age + Gender + CaCo:Gender", data = d)
Residuals:
Min 1Q Median 3Q Max
-0.5736 -0.1111 -0.0128 0.1007 1.1256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0924392 0.0276614 39.493 < 2e-16 ***
CaCoCase 0.0117859 0.0141087 0.835 0.404
Age -0.0029474 0.0004465 -6.601 6.36e-11 ***
GenderWoman 0.0037238 0.0138262 0.269 0.788
CaCoCase:GenderWoman 0.0325746 0.0220949 1.474 0.141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1757 on 1098 degrees of freedom
Multiple R-squared: 0.05002, Adjusted R-squared: 0.04655
F-statistic: 14.45 on 4 and 1098 DF, p-value: 1.662e-11
#----
It is my understanding that the coefficients should be interpreted as follows:
The
CaCoCase-term represents the differences between cases and controls, among males (male cases have 0.0117859 higher levels than male controls - not significant).The
GenderWoman-term represents the difference between genders, among controls (female controls have 0.0037238 higher levels than male - not significant)The
CaCoCase:GenderWoman-term represents how much greater the difference between cases and controls is among females than among males (i.e. female cases have 0.0117859 + 0.0325746 higher levels than male controls).
I hope I am right so far...?
Now, because I don't believe there is an effect of gender on X, but I suspect that the difference between cases and controls is mainly observed among women, I drop the main Gender-term and keep only the interaction CaCo:Gender, to get model m2:
#-----
Call:
lm(formula = "X ~ CaCo + Age + CaCo:Gender", data = d)
Residuals:
Min 1Q Median 3Q Max
-0.5736 -0.1111 -0.0128 0.1007 1.1256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0924392 0.0276614 39.493 < 2e-16 ***
CaCoCase 0.0117859 0.0141087 0.835 0.4037
Age -0.0029474 0.0004465 -6.601 6.36e-11 ***
CaCoControl:GenderWoman 0.0037238 0.0138262 0.269 0.7877
CaCoCase:GenderWoman 0.0362984 0.0172355 2.106 0.0354 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1757 on 1098 degrees of freedom
Multiple R-squared: 0.05002, Adjusted R-squared: 0.04655
F-statistic: 14.45 on 4 and 1098 DF, p-value: 1.662e-11
#-----
The model statistics are identical (as far as I can tell) between m1 and m2. It appears that the GenderWoman-term from m1 (the main effect of gender) have become CaCoControl:GenderWoman in m2, but I am assuming the interpretation is the same.
The only other difference between the models is the interaction term CaCoCase:GenderWoman, where the coefficient is slightly larger with a smaller error and consequently a much lower p-value.
The effect of the interaction term appears identical between m1 and m2 when illustrated using the effects-package:
library(effects)
plot(effect("CaCo:Gender", m1))
(As a side note, when modelling men and women separately by X ~ CaCo + Age, it appears clear that there is a difference between cases and controls among women, but not among men)
My questions are:
Are the interpretations of the coefficients the same between the models m1and m2? If so, what are the reasons they differ? If not, how should the coefficients be interpreted?
Any help is much appreciated!
