Why I can recover the total effect with OLS but not with logistic regression?

Question

The dataset ISLR2::Default contains one observation per individual.

The variable Default is "Yes" if the individual defaulted on his debt, and "No" otherwise.
The variable Student is "Yes" if the person is a student, and "No" otherwise.
The variable balance is the amount of the outstanding debt.
The variable income is the income of the individual.

Students are poorer than non students and are more indebted than non-students.

I first estimate a model of default on student alone.

This would give me the total effect of being a student, taking into account that students are poorer and more indebted.

Then I estimate a model of default on student+balance+income.

This would give me the effect of being a student alone, without considering that becoming a student makes you poorer and more indebted.

Now from the second model, I would like to recover the total effect as follows: direct effect of student + effect mediated by balance + effect mediated by income.

I set up a code that test this.

In the code I (1) obtain the total effect from the first model, (2) obtain the total effect from the second model (direct + mediated by income + mediated by balance), and (3) compare the two coefficients.

Using OLS to estimate the models, the comparison returns TRUE. However, using logistic regression to estimate the models, the comparison returns FALSE.

Why?

EQUATIONS

Here is what I want to do in equations.

Let's start with OLS.

I have Model 1:

$$p_{\text{DEF}} = \alpha_0 + \alpha_1 \text{Student}$$

and Model 2:

$$p_{\text{DEF}} = \beta_0 + \beta_1 \text{Student} + \beta_2 \text{Income} + \beta_3 \text{Balance}$$

If I write: $$ \text{Income} = \gamma_0 + \gamma_1 \text{Student} $$ $$ \text{Balance} = \omega_0 + \omega_1 \text{Student} $$

and substitute in Model 2, I obtain:

$$ \begin{align} p_{\text{DEF}} &= \beta_0 + \beta_1 \text{Student} + \beta_2 (\gamma_0 + \gamma_1 \text{Student}) + \beta_3 (\omega_0 + \omega_1 \text{Student}) \\ &= \beta_0 + \beta_1 \text{Student} + \beta_2 \gamma_0 + \beta_2 \gamma_1 \text{Student} + \beta_3 \omega_0 + \beta_3 \omega_1 \text{Student} \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0+\beta_3 \omega_0)}_{=\alpha_0} + \underbrace{(\beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1)}_{=\alpha_1} \text{Student} \end{align} $$

So:

$$ \alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1 $$

Checking with R, this last comparison is TRUE.

Now, let's re-write the models using logistic regression.

Model 1:

$$ \text{logit}(p_{\text{DEF}}) = \alpha_0 + \alpha_1 \text{Student} $$

Model 2:

$$ \text{logit}(p_{\text{DEF}}) = \beta_0 + \beta_1 \text{Student} + \beta_2 \text{Income} + \beta_3 \text{Balance} $$

If I write:

$$ \text{Income} = \gamma_0 + \gamma_1 \text{Student} $$ $$ \text{Balance} = \omega_0 + \omega_1 \text{Student} $$

Then:

$$ \begin{align} \text{logit}(p_{\text{DEF}}) &= \beta_0 + \beta_1 \text{Student} + \beta_2 (\gamma_0 + \gamma_1 \text{Student}) + \beta_3 (\omega_0 + \omega_1 \text{Student}) \\ &= \beta_0 + \beta_1 \text{Student} + \beta_2 \gamma_0 + \beta_2 \gamma_1 \text{Student} + \beta_3 \omega_0 + \beta_3 \omega_1 \text{Student} \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0+\beta_3 \omega_0)}_{=\alpha_0} + \underbrace{(\beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1)}_{=\alpha_1} \text{Student} \end{align} $$

Thus I obtain the same comparison as above:

$$ \alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1 $$

But, by checking with R, this time the comparison will return FALSE.

Why?

CODE

indf <- ISLR2::Default
indf$default <- as.logical(indf$default=="Yes")
indf$student <- as.logical(indf$student=="Yes")
WORKS: The comparison at the end of the code returns TRUE
A GLM with gaussian identity is an OLS
see: https://stats.stackexchange.com/questions/211585/how-does-ols-regression-relate-to-generalised-linear-modelling
myfamily <- gaussian(identity)
DOES NOT WORK
The comparison at the end of the code returns FALSE.
Why?
#myfamily <- binomial(link="logit")
lmod1.fit <- glm(default ~ student,
                data=indf,
                family=myfamily)
Total effect from first model
alpha1 <- lmod1.fit$coefficients["studentTRUE"]
lmod2.fit <- glm(default ~ student+income+balance, 
                 data=indf,
                 family=myfamily)
beta1 <- lmod2.fit$coefficients["studentTRUE"]
beta2 <- lmod2.fit$coefficients["income"]
beta3 <- lmod2.fit$coefficients["balance"]
mod.stu.inc <- lm(income~student, data=indf)
gamma1 <- mod.stu.inc$coefficients["studentTRUE"]
mod.stu.bal <- lm(balance~student, data=indf)
omega1 <- mod.stu.bal$coefficients["studentTRUE"]
Total effect from second model
tot <- beta1 + gamma1beta2 + omega1beta3
print(alpha1)
print(tot)
Compare the total effects obtained from the two models.
With OLS or GLM with binomial family, this returns TRUE.
But with logistic regression or probit regression, this returns FALSE.
Why?
print(all.equal(alpha1, tot))

I have currently no R with ISLR2 on my phone and can not verify it, but the code seems fine. Why doesn't the binomial family work? What do you mean by 'not working'? What error are you getting? Or do you just get coefficients that you cannot interpret because the binomial family uses a logit link function and uses the model $$E[Y|X] = \frac{1}{1+e^{-(\alpha+\beta X)}}$$ instead of $$E[Y|X] = \alpha+\beta X$$ — Sextus Empiricus, May 29 '22 at 09:18
@SextusEmpiricus I'm sorry for the generic terms I have used. "not working" means that print(all.equal(alpha1, tot)) returns FALSE. I'm editing the question — robertspierre, May 29 '22 at 10:14
So your question is whether the following is true $$\alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1$$ — Sextus Empiricus, May 29 '22 at 10:41
@SextusEmpiricus Supossing that the equations I wrote are correct, my question is why R tells me that that equation is true if I use OLS or GLM with binomial family to estimate the model, but it tells me that the equation is false if I use logistic regression or probit regression to estimate the model — robertspierre, May 29 '22 at 11:45
@SextusEmpiricus I just realized a GLM with gaussian identity is an OLS. I modified the question once again to make clear what's being asked — robertspierre, May 29 '22 at 13:24
You seem to be asking why a logit link function is not the identity. In rephrasing your question like this, it seems like it doesn't even need an answer. — whuber, May 31 '22 at 13:31
@whuber I'm asking why $\alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1 $ for the OLS model but not for the logistic model. It cannot be because the logit link function is not the identity, because that comparison is obtained from the RHS of the equations, not the LHS. I rewrote the question. — robertspierre, May 31 '22 at 17:40
I don't follow what you are doing, but I note (at "If I write...") that you are not writing the OLS equations correctly: you are neglecting the residuals. — whuber, May 31 '22 at 18:17

score 3 · Accepted Answer · answered May 31 '22 at 16:01

You seem to be testing whether

$$\alpha_1 = \beta_1 + \beta_2 \gamma_1 + \beta_3 \omega_1$$

but this only 'works' for ordinary least squares regression.

A geometric intuition for how this 'work' can be found in the intuition behind the term $(X^TX)^{-1}$ in the OLS regression estimate $\hat\beta = (X^TX)^{-1}X^Ty$, which is described in this question Intuition behind $(X^TX)^{-1}$ in closed form of w in Linear Regression

With linear regression you compute the linear correlation between $y$ and $X$ which is $\alpha = X^Ty$ and then you add a correction for this by the matrix $(X^TX)^{-1}$ which contains the information about how much the columns in $X$ explain each other. Basically you are doing the same with your formula $$\beta_1 = \alpha_1 - \beta_2 \gamma_1 - \beta_3 \omega_1$$

This only 'works' when you are going from a full model (in your case three parameters: student, balance, income) down to a model with a single variable.

The estimates in the two different situations are $$\hat{y}_{simple} = \alpha x_1$$ and $$\hat{y}_{complete} = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$$ We can find $\hat{y}_{simple}$ by projecting $\hat{y}_{complete}$ onto $x_1$. The part $\beta_1 x_1 $ projects completely onto $x_1$ and the parts $\beta_2 x_2$ and $\beta_3 x_3$ project onto $x_1$ depending on how much $x_2$ and $x_3$ project onto $x_1$ and that leads to your formula.

This derivation does not work when the model is not a linear projection, ie. when the model is using a different distribution family or different link function.

Why I can recover the total effect with OLS but not with logistic regression?

WORKS: The comparison at the end of the code returns TRUE

A GLM with gaussian identity is an OLS

see: https://stats.stackexchange.com/questions/211585/how-does-ols-regression-relate-to-generalised-linear-modelling

DOES NOT WORK

The comparison at the end of the code returns FALSE.

Why?

Total effect from first model

Total effect from second model

Compare the total effects obtained from the two models.

With OLS or GLM with binomial family, this returns TRUE.

But with logistic regression or probit regression, this returns FALSE.

Why?

1 Answers1