The dataset ISLR2::Default contains one observation per individual.
- The variable
Defaultis "Yes" if the individual defaulted on his debt, and "No" otherwise. - The variable
Studentis "Yes" if the person is a student, and "No" otherwise. - The variable
balanceis the amount of the outstanding debt. - The variable
incomeis the income of the individual.
Students are poorer than non students and are more indebted than non-students.
I first estimate a model of default on student alone.
This would give me the total effect of being a student, taking into account that students are poorer and more indebted.
Then I estimate a model of default on student+balance+income.
This would give me the effect of being a student alone, without considering that becoming a student makes you poorer and more indebted.
Now from the second model, I would like to recover the total effect as follows: direct effect of student + effect mediated by balance + effect mediated by income.
I set up a code that test this.
In the code I (1) obtain the total effect from the first model, (2) obtain the total effect from the second model (direct + mediated by income + mediated by balance), and (3) compare the two coefficients.
Using OLS to estimate the models, the comparison returns TRUE. However, using logistic regression to estimate the models, the comparison returns FALSE.
Why?
EQUATIONS
Here is what I want to do in equations.
Let's start with OLS.
I have Model 1:
$$p_{\text{DEF}} = \alpha_0 + \alpha_1 \text{Student}$$
and Model 2:
$$p_{\text{DEF}} = \beta_0 + \beta_1 \text{Student} + \beta_2 \text{Income} + \beta_3 \text{Balance}$$
If I write: $$ \text{Income} = \gamma_0 + \gamma_1 \text{Student} $$ $$ \text{Balance} = \omega_0 + \omega_1 \text{Student} $$
and substitute in Model 2, I obtain:
$$ \begin{align} p_{\text{DEF}} &= \beta_0 + \beta_1 \text{Student} + \beta_2 (\gamma_0 + \gamma_1 \text{Student}) + \beta_3 (\omega_0 + \omega_1 \text{Student}) \\ &= \beta_0 + \beta_1 \text{Student} + \beta_2 \gamma_0 + \beta_2 \gamma_1 \text{Student} + \beta_3 \omega_0 + \beta_3 \omega_1 \text{Student} \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0+\beta_3 \omega_0)}_{=\alpha_0} + \underbrace{(\beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1)}_{=\alpha_1} \text{Student} \end{align} $$
So:
$$ \alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1 $$
Checking with R, this last comparison is TRUE.
Now, let's re-write the models using logistic regression.
Model 1:
$$ \text{logit}(p_{\text{DEF}}) = \alpha_0 + \alpha_1 \text{Student} $$
Model 2:
$$ \text{logit}(p_{\text{DEF}}) = \beta_0 + \beta_1 \text{Student} + \beta_2 \text{Income} + \beta_3 \text{Balance} $$
If I write:
$$ \text{Income} = \gamma_0 + \gamma_1 \text{Student} $$ $$ \text{Balance} = \omega_0 + \omega_1 \text{Student} $$
Then:
$$ \begin{align} \text{logit}(p_{\text{DEF}}) &= \beta_0 + \beta_1 \text{Student} + \beta_2 (\gamma_0 + \gamma_1 \text{Student}) + \beta_3 (\omega_0 + \omega_1 \text{Student}) \\ &= \beta_0 + \beta_1 \text{Student} + \beta_2 \gamma_0 + \beta_2 \gamma_1 \text{Student} + \beta_3 \omega_0 + \beta_3 \omega_1 \text{Student} \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0+\beta_3 \omega_0)}_{=\alpha_0} + \underbrace{(\beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1)}_{=\alpha_1} \text{Student} \end{align} $$
Thus I obtain the same comparison as above:
$$ \alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1 $$
But, by checking with R, this time the comparison will return FALSE.
Why?
CODE
indf <- ISLR2::Default
indf$default <- as.logical(indf$default=="Yes")
indf$student <- as.logical(indf$student=="Yes")
WORKS: The comparison at the end of the code returns TRUE
A GLM with gaussian identity is an OLS
see: https://stats.stackexchange.com/questions/211585/how-does-ols-regression-relate-to-generalised-linear-modelling
myfamily <- gaussian(identity)
DOES NOT WORK
The comparison at the end of the code returns FALSE.
Why?
#myfamily <- binomial(link="logit")
lmod1.fit <- glm(default ~ student,
data=indf,
family=myfamily)
Total effect from first model
alpha1 <- lmod1.fit$coefficients["studentTRUE"]
lmod2.fit <- glm(default ~ student+income+balance,
data=indf,
family=myfamily)
beta1 <- lmod2.fit$coefficients["studentTRUE"]
beta2 <- lmod2.fit$coefficients["income"]
beta3 <- lmod2.fit$coefficients["balance"]
mod.stu.inc <- lm(income~student, data=indf)
gamma1 <- mod.stu.inc$coefficients["studentTRUE"]
mod.stu.bal <- lm(balance~student, data=indf)
omega1 <- mod.stu.bal$coefficients["studentTRUE"]
Total effect from second model
tot <- beta1 + gamma1beta2 + omega1beta3
print(alpha1)
print(tot)
Compare the total effects obtained from the two models.
With OLS or GLM with binomial family, this returns TRUE.
But with logistic regression or probit regression, this returns FALSE.
Why?
print(all.equal(alpha1, tot))
print(all.equal(alpha1, tot))returns FALSE. I'm editing the question – robertspierre May 29 '22 at 10:14