Inference with linear regression: single model vs. model+residuals

Question

Disclaimer: I know this is a basic question but I didn't manage to find an answer. Probably I don't know what I should use as search keywords.

Imagine we have a dataset of employees from a single company with the following columns:

age
gender
industry
race
education
salary

We want to do some inference, in particular we're interest in understanding if race and gender are factors that determine the salary. The underlying hypothesis is that the company is racist/sexist.

Approach 1

We can build a linear regression with all the variables:

salary ~ age + gender + industry + race + education

We can check if the model has a decent enough R2, a good p-value for the associated f-test, we can check the residuals and we can analyse the coefficients (with associated p-values).

Approach 2

We can build a linear regression with all the variables minus the ones we're interested in (gender and race):

salary ~ age + industry + education

We get the residuals of this model and we try to predict the residuals using the two variables of interest (gender and race) with a second linear regression model:

residuals ~ gender + race

Questions

1 - What are pros/cons of each approach?
2 - Is there an overall better approach?
3 - What other approaches am I forgetting about?

You can also generate a correlation matrix for the variables and look at the values of pairs — Dr. Eldard Mukasa, Jun 22 '18 at 08:02
Looking at one variable at the time (with the correlation matrix) can be quite limited since we don't take into account the effect of the other variables, isn't it? — Bustic01, Jun 25 '18 at 03:13
Approach 2 has no pros and a number of cons. In the way that you have stated it, it does not yield valid inference. See this related question and answer for more details https://stats.stackexchange.com/questions/286850/linear-regression-confounder/287187 — Gordon Smyth, Jul 03 '18 at 08:21

Disou · Accepted Answer · 2018-06-22T10:41:58.070

I really can't see any pros in the second approach if all you want to do is perform statistical inference.

First Approach

Using the first approach, you will not only test for the existence of discrimination (whether race or gender based) but will also give you an estimate of the size of discrimination. For example, assume:

gender = 1 for females

race = 1 for non whites.

If you get a result looking like this (notice number are fictitious and i added an intercept):

salary ~ 2000 + 100*age - 200*gender + 300*industry - 400*race + 500*education

where the variables gender and race are statistically significant at some reasonable level (e.g., p < 0.05) then you can conclude that all other things being equal:

females make on average 200 units (e.g. dollars) less and non-whites make on average 400 units less. You might even consider interactions.

Second Approach

This is really an indirect way of getting what you want. First build a model and then try to see whether its errors can be explained by gender and race. The estimated coefficient of the second model will only tell you how much of the error of the first model is explained by these variables.

Your conclusions would essentially be: We need to include (or not, depending on the p-values) these variables in the first model estimated. So why work in circles and not do it all together in one step.

The first approach will allow you to test the hypotheses you are interested in. The second would not do this directly. It would be more useful if you wanted to predict the outcome as closely as possible and thus wanted to focus on the errors and their minimization.

I see your point. So is the second two-step method ever used? I've seen it used once but unfortunately I don't have the reference. — Bustic01, Jun 25 '18 at 03:25
10-years in academia, i have never seen anyone taking the second approach. Theory dictates which variables are entered in the equation, all at once, or else you are facing several estimation issues (e.g., omitted variable bias). Nevertheless, if you were regressing one variable at a time on the residuals of the previous equation then you would be close (but not quite!) to a boosting algorithm. But that would be about fitting the data as closely as possible, not testing a hypothesis. — Disou, Jun 25 '18 at 10:19

score 1 · Answer 2 · edited Jun 11 '20 at 14:32

If I understand correctly, it seems you are trying to determine

The best linear regression model from a set of several variables
Which variables are statistically significant (and which are not)

One word of caution, however:

We want to do some inference, in particular we're interest in understanding if race and gender are factors that determine the salary. The underlying hypothesis is that the company is racist/sexist.

Unless your data comes from a fancy experiment with controlled variables etc.. statistical significance does not necessarily imply causality. With that out of the way I don't see why you couldn't combine the two approaches. I see two main steps instead:

Decide on the comparison criteria

The following will help you compare different models between each other:

Mallow's $C_p$
Akaike's Information Criteria (AIC)
Bayesian Information Criteria (BIC)
Adjusted $R^2$
Fisher test for model significance
Fishier statistic for the test of significance of nested models

You can also evaluate the significance of variables using

Student T statistic for the test of significance of parameter

Theoretical justification

$C_p$, AIC, BIC all have rigorous theoretical justification that rely on asymptotic arguments, i.e. when the sample size $m$ grows very large, whereas the adjusted $R^2$, although quite intuitive, is not as well motivated in statistical theory.

Decide on the iterative procedure

Best subset selection

Fit separate models for each possible combination of the $n$ predictors and then select the best subset. That is we fit:

All models that contains exactly one predictor
All models that contain 2 predictors at the second step: $\binom{K}{2}$
Until reaching the end point where all $K$ predictors are included in the model

Given K explanatory variables, you will have $2^K$ different models to compare, which can become quite large ! For example $2^{11} = 2048$. For computational reasons, the best subset cannot be applied for any large $K$. You can use instead:

Forward stepwise selection

Forward Stepwise begins with a model containing no predictors, and then adds predictors to the model, one at the time. At each step, the variable that gives the greatest additional improvement to the fit is added to the model.

Backward stepwise, hybrids and many more...

In practice

Since you have a small number of explanatory variables, I would perform a best subset procedure, compare all the models using the various criteria mentionned above, and then see if indeed your preferred variables are statistically significant and /or lead to the best model.

Don't forget to check for multi-correlation, confidence intervals, and validate linear regression assumptions such as normality of the errors and homoscedasticity of the variance.

Here is a graph of what best subset selection could look like using Residual Sum of Squares (RSS) and $R^2$ criteria

And here what a forward stepwise selection would look like on the same data

Code can be found here

Thanks, I'm familiar with the approaches you're describing and your point about causation-correlation is very correct. I'm trying to find the answer to the "when is a two-step model better than a simple approach (i.e. building+optimizing a regression model)?" question. — Bustic01, Jun 25 '18 at 03:28