4

I am using the UCBAdmissions dataset and glm from R to check for Simpson's paradox. I am aware that with data that I have, it is easier to visualise the paradox, but I am trying to inspect it via a Poisson/Log-linear model too.

My attempt

To visualise this is easy:

library(broom)
ucb_tidy <- tidy(UCBAdmissions) %>% group_by(Gender, Dept) %>% mutate(prop = n/sum(n))

ggplot(ucb_tidy, aes(x=Gender, y=prop, fill = Admit)) + ggtitle("Acceptance Rate Split by Gender & Department") + facet_wrap(~ Dept, nrow = 2) + guides(fill = guide_legend(reverse = TRUE)) + geom_bar(stat = "identity") + xlab("Gender") + ylab("% of Applicants") + scale_y_continuous(labels = percent_format())

enter image description here

But to check and show this with a Poisson model, I tried the following:

# Postulate that Gender and Admission are conditionally independent, given Dept.
cond_ind1 <- glm(n ~ (Gender + Admit) * Dept, family = poisson, data = ucb_tidy)
summary(cond_ind1)

Call: glm(formula = n ~ (Gender + Admit) * Dept, family = poisson, data = ucb_tidy)

Deviance Residuals: Min 1Q Median 3Q Max
-3.4776 -0.4144 0.0098 0.3089 2.2321

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.27557 0.04248 147.744 < 2e-16 *** GenderFemale -2.03325 0.10233 -19.870 < 2e-16 *** AdmitRejected -0.59346 0.06838 -8.679 < 2e-16 *** DeptB -0.40575 0.06770 -5.993 2.06e-09 *** DeptC -1.53939 0.08305 -18.536 < 2e-16 *** DeptD -1.32234 0.08159 -16.207 < 2e-16 *** DeptE -2.40277 0.11014 -21.816 < 2e-16 *** DeptF -3.09624 0.15756 -19.652 < 2e-16 *** GenderFemale:DeptB -1.07581 0.22860 -4.706 2.52e-06 *** GenderFemale:DeptC 2.63462 0.12343 21.345 < 2e-16 *** GenderFemale:DeptD 1.92709 0.12464 15.461 < 2e-16 *** GenderFemale:DeptE 2.75479 0.13510 20.391 < 2e-16 *** GenderFemale:DeptF 1.94356 0.12683 15.325 < 2e-16 *** AdmitRejected:DeptB 0.05059 0.10968 0.461 0.645
AdmitRejected:DeptC 1.20915 0.09726 12.432 < 2e-16 *** AdmitRejected:DeptD 1.25833 0.10152 12.395 < 2e-16 *** AdmitRejected:DeptE 1.68296 0.11733 14.343 < 2e-16 *** AdmitRejected:DeptF 3.26911 0.16707 19.567 < 2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2650.095  on 23  degrees of freedom

Residual deviance: 21.736 on 6 degrees of freedom AIC: 216.8

Number of Fisher Scoring iterations: 4

This, however, is not a good fit if we use pchisq to check for goodness of fit, and does not directly show the effect of admission via departments, for sex.

I thought of another formulation:

# Postulate that Gender and Admission are jointly independent of Dept (or not)
joint_ind1 <- glm(n ~ Dept + (Gender*Admit), family = poisson, data = ucb_tidy)
summary(joint_ind1)

Call: glm(formula = n ~ Dept + (Gender * Admit), family = poisson, data = ucb_tidy)

Deviance Residuals: Min 1Q Median 3Q Max
-19.7226 -7.8760 0.6475 7.1298 14.7146

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.50922 0.04106 134.188 < 2e-16 *** DeptB -0.46679 0.05274 -8.851 < 2e-16 *** DeptC -0.01621 0.04649 -0.349 0.727355
DeptD -0.16384 0.04832 -3.391 0.000696 *** DeptE -0.46850 0.05276 -8.879 < 2e-16 *** DeptF -0.26752 0.04972 -5.380 7.44e-08 *** GenderFemale -0.76584 0.05128 -14.933 < 2e-16 *** AdmitRejected 0.22013 0.03879 5.675 1.38e-08 *** GenderFemale:AdmitRejected 0.61035 0.06389 9.553 < 2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2650.1  on 23  degrees of freedom

Residual deviance: 2004.2 on 15 degrees of freedom AIC: 2181.3

Number of Fisher Scoring iterations: 5

This has an even poorer fit, and does not have the logical appeal as the conditionally independent model.


My question is:

  1. How can I study Simpson's paradox via the Poisson regression model? Would you recommend another model (e.g., binomial logistic regression)?

1 Answers1

5

Here's a few ideas. This answer draws on a post by Norm Matloff, who fit log-linear models on the same dataset in R using loglin() to show how the paradox is related to the order in which the variables are considered.

First we'll load the packages and data. I'll also set up deviation contrasts to mirror the output we'd get from a traditional log-linear analysis (as in the post above), but using glm().

# Import Packages
library(tidyverse)

Import data

ucb_tidy <- broom::tidy(UCBAdmissions) %>% mutate_at(c("Gender", "Dept", "Admit"), as.factor)

Specify Deviation Contrasts

contrasts(ucb_tidy$Admit) <- contrasts(ucb_tidy$Gender) <- contr.sum(2) contrasts(ucb_tidy$Dept) <- contr.sum(6)

Poisson Regression

Now if you're interested in model fit, we can fit three nested models.

  • The first will include admission by gender terms, while ignoring the effects of department.
  • The second will also include department and department-admission interactions.
  • The third will do the same, but also include an interaction between gender and department.

We can then use likelihood ratio tests (using the lmtest package) to evaluate whether these additional department-interaction terms progressively improve our model fit.

library(lmtest)

Model Building Approach

m1 <- glm(n ~ Admit + Gender + Admit:Gender, data = ucb_tidy, family = poisson) m2a <- update(m1, . ~ . + Dept + Admit:Dept) m2b <- update(m2a, . ~ . + Gender:Dept)

Test model fit using lmtest

lrtest(m1, m2a, m2b)

Incorporating the department as a predictor that also interacts with admissions (i.e. m2a) improves the model fit relative to the model ignoring the department (i.e. m1), $\chi^2=1014.8, p < .0001$. The model that additionally incorporates an interaction between gender and department (i.e. m2b) improves model fit over and above this, $\chi^2=1128.7, p < .0001$. We would therefore select m2b as our final model1, although you could continue on with the three-way interaction if you're feeling so bold (although this model would be saturated).

The 'paradox' is probably easier to understand though when you plot the model predictions. I'll use emmeans for this, and put the predictions on the numbers admitted scale (rather than the default log scale). The patchwork package just joins the two plots at the end using +.

# Plot model predictions using emmeans and patchwork
library(emmeans)
library(patchwork)

m1_plot <- emmip(m1, Gender ~ Admit, CIs = TRUE, type = "response") + labs(title = "m1 - No Department Term") + theme_minimal() + ylim(c(50, 300)) + theme(legend.position = "none")

m2b_plot <- emmip(m2b, Gender ~ Admit, CIs = TRUE, type = "response") + labs(title = "m2b - Department Interactions") + theme_minimal() + ylim(c(50, 300))

m1_plot + m2b_plot

enter image description here

While fewer women were applying overall in this UCB dataset (roughly half the number of men), the model that ignores which departments they were applying to would suggest the women were rejected at a higher rate relative to the men. The model that does include the department as a predictor though, and particularly the department-gender interaction, would suggest the opposite effect (when averaging over the departments).2 You can check how close the model predictions were to the raw data using emmeans once more:

# Predictions
emmeans(m2b, ~ Dept:Admit + Dept:Gender, type = "response")

Observed

ucb_tidy

Logistic Regression

Finally, it is indeed possible to model the proportions of people admitted by department and gender using logistic regression (rather than the raw counts, as in the poisson regression above). You'd do that like so:

# Convert Data
ucb_logit <- spread(ucb_tidy, Admit, n)

Model Building Approach

ml1 <- glm(cbind(Admitted, Rejected) ~ Gender, data = ucb_logit, family = binomial) ml2a <- update(ml1, . ~ . + Dept) ml2b <- update(ml2a, . ~ . + Gender:Dept)

Test Nested Models

lrtest(ml1, ml2a, ml2b)

Plot Probability of Acceptance by Dept & Gender

emmip(ml2b, Gender ~ Dept, type = "response", CI = TRUE)

This plot shows the model can replicate the observed proportions in the dataset fairly well.


1 This model gives us identical parameter estimates to Norm's loglin() results.
2 Note that the scales are slightly different because including department in the model 2b weights the estimates differently.

awhug
  • 1,110
  • 3
    (+1) Nice illustration! P.S. Shouldn't this be broom::tidy instead of just tidy? – chl Oct 27 '20 at 10:58
  • 1
    nice answer and good plots! – StupidWolf Oct 27 '20 at 11:29
  • 1
    @awhug I am indebted to you for having educated me :)

    I think in this use-case, it seems a logistic regression model is probably more intuitive as it lets us see difference in prob. of being admitted readily, while a poisson/log-lin model lets us see the difference only in terms of counts. Also perhaps because a log-lin model does not directly model relationship between dependent and independent variables (instead, assuming no such distinction exists)

    – info_seeker Oct 27 '20 at 16:43
  • 1
    My pleasure @info_seeker. Yeah, the logistic model seems a little 'neater' to me in this case, just wanted to make sure to answer your original question about the poisson model. And thank you @chl -- I had assumed broom was loaded with the tidyverse but have just edited to be explicit now. – awhug Oct 27 '20 at 23:34