4

Problem Statement

So I read in Elements of Statistical Learning, p. 54 that another way of doing OLS for multiple predictors is the following way. Let's assume two predictors $X_1, X_2$ and a target variable: $Y$.

  • Regress: $Y \sim X_1$, that gives you $\beta_1$

  • then take the residuals from that, $e_1$ and run: $X_2 \sim e_1$ to get your $\beta_2$


Intuitive Explanation

Does make sense intuitively as the first step gives us: "How much of $Y$ is explainable by $X_1$", and then the second step tells us: "How much of Y not explained by $X_1$, can be explained by $X_2$. i.e. How much residual variance in $Y$ is explainable by $X_2$."


Empirical Simulation

But then I had to ask: what happens when you flip things the other way? So I wrote a small bit of code to simulate this.

Trivial example: If $X_1$ and $X_2$ are standard gaussian, I get 1's all around...

But it gives the different values if I flip the order and they don't have the same variance...

Here is the code (Python3) that runs what I just said for a fictional $Y = X_1 + X_2$:

where $X_1 = N(0, 4)$ and $X_2 = N(0, 1)$

($X_1$ in the code is just $2*N(0,1)$)

import numpy as np
from sklearn import linear_model

Generating Y ~ X1 + X2

N = int(1e4)

X1 = [] X2 = [] X1X2 = [] Y = []

for i in range(N): x1 = np.random.normal()*2 # (multiplying by 2 makes variance=4) x2 = np.random.normal()

y = x1 + x2

X1.append(x1)
X2.append(x2)

# stack them both for the full regression:
X1X2.append([x1,x2])

Y.append(y)

X1 = np.array(X1).reshape(-1,1) X2 = np.array(X2).reshape(-1,1) X1X2 = np.array(X1X2)

Y = np.array(Y)

Running the regression starting with X1 first

reg = linear_model.LinearRegression() reg.fit(X1, np.array(Y)) resid = Y - reg.predict(X1)

print(f"Y~X1 --> Beta_2 is: {reg.coef_}")

resid_fit = linear_model.LinearRegression() resid_fit.fit(resid.reshape(-1,1), X2)

print(f"X2~e1 --> Beta_2 is: {resid_fit.coef_}") print("-------------")

Then flip:

reg = linear_model.LinearRegression() reg.fit(X2, np.array(Y)) resid = Y - reg.predict(X2)

print(f"Y~X2 --> Beta_2 is: {reg.coef_}")

resid_fit = linear_model.LinearRegression() resid_fit.fit(resid.reshape(-1,1), X1)

print(f"X1~e2 --> Beta_1 is: {resid_fit.coef_}")

print("-------------") print("-------------")

Then both:

reg = linear_model.LinearRegression() reg.fit(X1X2, np.array(Y))

print(f"Full Regression Coeffs: ", reg.coef_)


If I start with $X_1$, I get the true coefficients: $\beta_1=2, \beta_2=1$, but if I start with $X_2$, I get: $ \beta_1=1, \beta_2=0.5$, half the real values?.

I ran it with $X_1 = N(0,4)$ and $X_2 = N(0, 9)$, and got the following output (rounded for clarity):

Y~X1  --> Beta_1 is: [2.01558849]
X2~e1 --> Beta_2 is: [[0.33333333]]
-------------
Y~X2  --> Beta_2 is: [3.00462912]
X1~e2 --> Beta_1 is: [[0.5]]
-------------
-------------
Full Regression Coeffs:  [ 2.00000000e+00  3.00000000e+00 -3.33066907e-16]

So it looks like it's doing:

$$\hat{\beta_i} = \frac{\beta_i}{Var(X_i)}$$

for whichever $X_i$ goes second... why?


Question:

  • What's going on here?
  • Is there an intuitive explanation?
  • How can I derive that final formula for $\beta_i$ I wrote above?
JoeVictor
  • 271

1 Answers1

5

The authors couldn't be clearer: before they presented the algorithm, there was a decent elaboration on what they were to.

But let me break into smaller pebbles.


Let $\mathbb V(F) $ be a inner-product space. Let $\mathbf u, \mathbf v\in \mathbb V; ~\mathbf v\ne \mathbf 0.$ Then $\mathbf u$ can be decomposed orthogonally in the form $\mathbf u = c\mathbf v + \mathbf w $ for some suitable $c\in F, \mathbf w\in \mathbb V $ as

$$ \mathbf u = \underbrace{\frac{\langle \mathbf u, \mathbf v\rangle}{\Vert \mathbf v\Vert^2}}_{:=c}\mathbf v + \underbrace{\left(\mathbf u -\frac{\langle \mathbf u, \mathbf v\rangle}{\Vert \mathbf v\Vert^2}\mathbf v \right)}_{:=\mathbf w}$$ where $\mathbf u\perp \mathbf w.$


Before specifically delving into the authors' content, let us have a look at how to evaluate the coefficients for two set of variables $\mathbf X_1, ~\mathbf X_2$ i.e. we are interested to estimate $\boldsymbol{\beta}_i; i = 1,2$ in $\mathbf y = \mathbf X_1\boldsymbol\beta_1 +\mathbf X_2\boldsymbol\beta_2 +\boldsymbol\varepsilon.$ The corresponding normal equations are

$$\begin{bmatrix}\mathbf X_1^\top\mathbf X_1 & \mathbf X_1^\top\mathbf X_2\\\mathbf X_2^\top\mathbf X_1 & \mathbf X_2^\top\mathbf X_2 \end{bmatrix}\begin{bmatrix}\mathbf b_1\\ \mathbf b_2\end{bmatrix} = \begin{bmatrix}\mathbf X_1^\top \mathbf y\\\mathbf X_2^\top \mathbf y \end{bmatrix}\tag 1\label 1;$$ if in $\eqref 1,$ the off-diagonal elements are $\bf 0$ i.e. $\mathbf X_1\perp \mathbf X_2,$ then it's easy to check $\mathbf b_i = \left(\mathbf X_i^\top\mathbf X_i\right)^{-1}(\mathbf X_i^\top\mathbf y), ~i=1,2.$

When $\mathbf X_1\not\perp\mathbf X_2,$ then routine procedure would yield $$\mathbf b_2 = \left[\mathbf X_2^\top\left(\mathbf I-\mathbf X_1\left(\mathbf X_1^\top\mathbf X_1\right)^{-1}(\mathbf X_1^\top)\right)\mathbf X_2\right]^{-1}\left[\mathbf X_2^\top\left(\mathbf I-\mathbf X_1\left(\mathbf X_1^\top\mathbf X_1\right)^{-1}(\mathbf X_1^\top)\right)\mathbf y\right]\tag 2\label 2;$$ in $\eqref 2, ~\mathbf M_1:= \mathbf I-\mathbf X_1\left(\mathbf X_1^\top\mathbf X_1\right)^{-1}(\mathbf X_1^\top)$ is a projection operator that creates residuals when regressed on $\mathbf X_1.$ Therefore re-writing $\eqref 2$ using $\mathbf X_2^\star:= \mathbf M_1\mathbf X_2$ - this creates a vector of residuals when $\mathbf X_2$ is regressed on $\mathbf X_1$ (same with $\mathbf y^\star$)- as

$$\mathbf b_2 = \left({\mathbf X_2^\star}^\top\mathbf X_2^\star\right)^{-1}\mathbf X_2^\star\mathbf y^\star\tag 3.$$ This is the essence of FWL Theorem as also was pointed out in the comment. It says that $\mathbf b_2$ is evaluated by regressing the residuals of $\mathbf y$ when regressed on $\mathbf X_1$ on the set of residuals of $\mathbf X_2$ when regressed again on $\mathbf X_1.$


When $\mathbf x_2$ is regressed on $\mathbf x_1,~\mathbf z\perp \mathbf x_1.$ Then from the above discussion, project $\mathbf y$ on $\mathbf x_1$ and $\mathbf z$ to get the required estimates.

If you look into the algorithm, it is nothing but constructing $\mathbf z_i$ such that $$\operatorname{span}(\mathbf x_1,\mathbf x_2, \ldots, \mathbf x_p) = \operatorname{span}(\mathbf z_1,\mathbf z_2, \ldots, \mathbf z_p) \wedge \langle \mathbf z_i,\mathbf z_j\rangle = 0.$$ This makes life easier in estimating the coefficients as shown above.


Reference:

$\rm [I]$ Econmetric Analysis, William H. Greene, sec. $3.3,$ pp. $35-37,$ Pearson, $2018.$

User1865345
  • 8,202