0

Imagine I'm trying to model traffic stops with a logistic regression.

race = relevel(as.factor(c(rep("white", 3000), rep("black", 3000))), ref = "white")
income = relevel(as.factor(c(rep("low", 1500),rep("high",1500), rep("low", 2000),rep("high",1000))),ref = "high")
stop = c(rbinom(1500,1,0.5),rbinom(1500,1,0.4),rbinom(2000,1,0.8),rbinom(1000,1,0.5))
mod = glm(stop ~ race + income + race:income, family = "binomial")
summary(mod)
#Coefficients:
#                    Estimate Std. Error t value Pr(>|t|)    
#(Intercept)          0.38600    0.01212  31.841  < 2e-16 ***
#raceblack            0.10800    0.01917   5.634 1.84e-08 ***
#incomelow            0.14400    0.01714   8.399  < 2e-16 ***
#raceblack:incomelow  0.14250    0.02499   5.702 1.24e-08 ***
#---
#    Null deviance: 1469.3  on 5999  degrees of freedom
#Residual deviance: 1321.8  on 5996  degrees of freedom
#AIC: 7960.5

Further imagine I wanted to report the marginal effect of being a low income Black individual with regards to probability of being stopped, something like "compared to a high income white individual, a low income Black individual has a 1.29 OR of being stopped."

If there were no interaction coefficients in the models, I could just add the coefficients:

sum(coef(model)[c("raceblack",'incomelow')])
#[1] 0.252

I also learned from this Q&A that I can use general linear hypothesis testing.

library(multcomp)
glht(model, linfct = "raceblack + incomelow = 0")
#    General Linear Hypotheses
#Linear Hypotheses:
#                           Estimate
#raceblack + incomelow == 0    0.252

My question is, since the model includes an interaction coefficient, should I also include the interaction coefficient when calculating the marginal effect?

sum(coef(model)[c("raceblack",'incomelow','raceblack:incomelow')])
#[1] 0.3945
glht(model, linfct = "raceblack + incomelow + raceblack:incomelow = 0") 
#    General Linear Hypotheses
#Linear Hypotheses:
#                                                 Estimate
#raceblack + incomelow + raceblack:incomelow == 0   0.3945

1 Answers1

1

This is best addressed by forming contrasts of interest then estimating those contrasts and computing confidence intervals (simultaneous intervals when appropriate) for them. Also do chunk tests for assessing evidence against the supposition that all elements of the vector of contrasts are zero. Here are examples using the R rms package.

require(rms)
dd <- datadist(mydata); options(datadist='dd')
# Need x=TRUE, y=TRUE for likelihood ratio chi-square tests
f <- lrm(stop ~ race * income, data=mydata, x=TRUE, y=TRUE)
anova(f, test='LR')  # provides all chunk (multiple d.f.) tests
# Compare white vs. black at each of two income levels
contrast(f, list(race='white', income=c('low', 'high')),
            list(race='black', income=c('low', 'high')))

One of the anova chunk tests assesses whether there is a race effect for any income group, no assume the effects are the same across income.

Note that the severe categorization of income represents a severe loss of information, creates more confounding, and limits interpretation.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • The actual story is that I'm reviewing a paper where the authors are making assertions as described in the question (I know the subject, but I'm not a particularly great statistician). "Redo the entire analysis with continuous income" may not be a reasonable ask. – Ian Campbell Jul 10 '23 at 15:00
  • If continuous income is available, categorizing it is a statistical sin. – Frank Harrell Jul 10 '23 at 21:10