How to prove OLS residual regressions mathematically

Question

Problem Statement

So I read in Elements of Statistical Learning, p. 54 that another way of doing OLS for multiple predictors is the following way. Let's assume two predictors $X_1, X_2$ and a target variable: $Y$.

Regress: $Y \sim X_1$, that gives you $\beta_1$
then take the residuals from that, $e_1$ and run: $X_2 \sim e_1$ to get your $\beta_2$

Intuitive Explanation

Does make sense intuitively as the first step gives us: "How much of $Y$ is explainable by $X_1$", and then the second step tells us: "How much of Y not explained by $X_1$, can be explained by $X_2$. i.e. How much residual variance in $Y$ is explainable by $X_2$."

Empirical Simulation

But then I had to ask: what happens when you flip things the other way? So I wrote a small bit of code to simulate this.

Trivial example: If $X_1$ and $X_2$ are standard gaussian, I get 1's all around...

But it gives the different values if I flip the order and they don't have the same variance...

Here is the code (Python3) that runs what I just said for a fictional $Y = X_1 + X_2$:

where $X_1 = N(0, 4)$ and $X_2 = N(0, 1)$

($X_1$ in the code is just $2*N(0,1)$)

import numpy as np
from sklearn import linear_model
Generating Y ~ X1 + X2
N = int(1e4)
X1 = []
X2 = []
X1X2 = []
Y = []
for i in range(N):
    x1 = np.random.normal()*2  # (multiplying by 2 makes variance=4)
    x2 = np.random.normal()
y = x1 + x2

X1.append(x1)
X2.append(x2)

# stack them both for the full regression:
X1X2.append([x1,x2])

Y.append(y)


X1 = np.array(X1).reshape(-1,1)
X2 = np.array(X2).reshape(-1,1)
X1X2 = np.array(X1X2)
Y = np.array(Y)
Running the regression starting with X1 first
reg = linear_model.LinearRegression()
reg.fit(X1, np.array(Y))
resid = Y - reg.predict(X1)
print(f"Y~X1  --> Beta_2 is: {reg.coef_}")
resid_fit = linear_model.LinearRegression()
resid_fit.fit(resid.reshape(-1,1), X2)
print(f"X2~e1 --> Beta_2 is: {resid_fit.coef_}")
print("-------------")
Then flip:
reg = linear_model.LinearRegression()
reg.fit(X2, np.array(Y))
resid = Y - reg.predict(X2)
print(f"Y~X2  --> Beta_2 is: {reg.coef_}")
resid_fit = linear_model.LinearRegression()
resid_fit.fit(resid.reshape(-1,1), X1)
print(f"X1~e2 --> Beta_1 is: {resid_fit.coef_}")
print("-------------")
print("-------------")
Then both:
reg = linear_model.LinearRegression()
reg.fit(X1X2, np.array(Y))
print(f"Full Regression Coeffs: ", reg.coef_)

If I start with $X_1$, I get the true coefficients: $\beta_1=2, \beta_2=1$, but if I start with $X_2$, I get: $ \beta_1=1, \beta_2=0.5$, half the real values?.

I ran it with $X_1 = N(0,4)$ and $X_2 = N(0, 9)$, and got the following output (rounded for clarity):

Y~X1  --> Beta_1 is: [2.01558849]
X2~e1 --> Beta_2 is: [[0.33333333]]
-------------
Y~X2  --> Beta_2 is: [3.00462912]
X1~e2 --> Beta_1 is: [[0.5]]
-------------
-------------
Full Regression Coeffs:  [ 2.00000000e+00  3.00000000e+00 -3.33066907e-16]

So it looks like it's doing:

$$\hat{\beta_i} = \frac{\beta_i}{Var(X_i)}$$

for whichever $X_i$ goes second... why?

Question:

What's going on here?
Is there an intuitive explanation?
How can I derive that final formula for $\beta_i$ I wrote above?

For context: what you're describing is known as the Frisch-Waugh theorem in statistics. — Durden, Jun 08 '23 at 03:50
Here on CV you can find several equivalent versions of this question and many answers. See, among others, https://stats.stackexchange.com/a/113207/919, https://stats.stackexchange.com/a/46508/919, and the "Background" section at https://stats.stackexchange.com/a/89766/919. — whuber, Jun 08 '23 at 14:51

User1865345 · Accepted Answer · 2023-06-08T06:03:00.787

The authors couldn't be clearer: before they presented the algorithm, there was a decent elaboration on what they were to.

But let me break into smaller pebbles.

Let $\mathbb V(F) $ be a inner-product space. Let $\mathbf u, \mathbf v\in \mathbb V; ~\mathbf v\ne \mathbf 0.$ Then $\mathbf u$ can be decomposed orthogonally in the form $\mathbf u = c\mathbf v + \mathbf w $ for some suitable $c\in F, \mathbf w\in \mathbb V $ as

$$ \mathbf u = \underbrace{\frac{\langle \mathbf u, \mathbf v\rangle}{\Vert \mathbf v\Vert^2}}_{:=c}\mathbf v + \underbrace{\left(\mathbf u -\frac{\langle \mathbf u, \mathbf v\rangle}{\Vert \mathbf v\Vert^2}\mathbf v \right)}_{:=\mathbf w}$$ where $\mathbf u\perp \mathbf w.$

Before specifically delving into the authors' content, let us have a look at how to evaluate the coefficients for two set of variables $\mathbf X_1, ~\mathbf X_2$ i.e. we are interested to estimate $\boldsymbol{\beta}_i; i = 1,2$ in $\mathbf y = \mathbf X_1\boldsymbol\beta_1 +\mathbf X_2\boldsymbol\beta_2 +\boldsymbol\varepsilon.$ The corresponding normal equations are

$$\begin{bmatrix}\mathbf X_1^\top\mathbf X_1 & \mathbf X_1^\top\mathbf X_2\\\mathbf X_2^\top\mathbf X_1 & \mathbf X_2^\top\mathbf X_2 \end{bmatrix}\begin{bmatrix}\mathbf b_1\\ \mathbf b_2\end{bmatrix} = \begin{bmatrix}\mathbf X_1^\top \mathbf y\\\mathbf X_2^\top \mathbf y \end{bmatrix}\tag 1\label 1;$$ if in $\eqref 1,$ the off-diagonal elements are $\bf 0$ i.e. $\mathbf X_1\perp \mathbf X_2,$ then it's easy to check $\mathbf b_i = \left(\mathbf X_i^\top\mathbf X_i\right)^{-1}(\mathbf X_i^\top\mathbf y), ~i=1,2.$

When $\mathbf X_1\not\perp\mathbf X_2,$ then routine procedure would yield $$\mathbf b_2 = \left[\mathbf X_2^\top\left(\mathbf I-\mathbf X_1\left(\mathbf X_1^\top\mathbf X_1\right)^{-1}(\mathbf X_1^\top)\right)\mathbf X_2\right]^{-1}\left[\mathbf X_2^\top\left(\mathbf I-\mathbf X_1\left(\mathbf X_1^\top\mathbf X_1\right)^{-1}(\mathbf X_1^\top)\right)\mathbf y\right]\tag 2\label 2;$$ in $\eqref 2, ~\mathbf M_1:= \mathbf I-\mathbf X_1\left(\mathbf X_1^\top\mathbf X_1\right)^{-1}(\mathbf X_1^\top)$ is a projection operator that creates residuals when regressed on $\mathbf X_1.$ Therefore re-writing $\eqref 2$ using $\mathbf X_2^\star:= \mathbf M_1\mathbf X_2$ - this creates a vector of residuals when $\mathbf X_2$ is regressed on $\mathbf X_1$ (same with $\mathbf y^\star$)- as

$$\mathbf b_2 = \left({\mathbf X_2^\star}^\top\mathbf X_2^\star\right)^{-1}\mathbf X_2^\star\mathbf y^\star\tag 3.$$ This is the essence of FWL Theorem as also was pointed out in the comment. It says that $\mathbf b_2$ is evaluated by regressing the residuals of $\mathbf y$ when regressed on $\mathbf X_1$ on the set of residuals of $\mathbf X_2$ when regressed again on $\mathbf X_1.$

When $\mathbf x_2$ is regressed on $\mathbf x_1,~\mathbf z\perp \mathbf x_1.$ Then from the above discussion, project $\mathbf y$ on $\mathbf x_1$ and $\mathbf z$ to get the required estimates.

If you look into the algorithm, it is nothing but constructing $\mathbf z_i$ such that $$\operatorname{span}(\mathbf x_1,\mathbf x_2, \ldots, \mathbf x_p) = \operatorname{span}(\mathbf z_1,\mathbf z_2, \ldots, \mathbf z_p) \wedge \langle \mathbf z_i,\mathbf z_j\rangle = 0.$$ This makes life easier in estimating the coefficients as shown above.

Reference:

$\rm [I]$ Econmetric Analysis, William H. Greene, sec. $3.3,$ pp. $35-37,$ Pearson, $2018.$

How to prove OLS residual regressions mathematically

Problem Statement

Intuitive Explanation

Empirical Simulation

Generating Y ~ X1 + X2

Running the regression starting with X1 first

Then flip:

Then both:

Question:

1 Answers1

Reference:

Linked