4

I've rewritten this question, because my phrasing and notation was confusing.

We're assuming OLS regression throughout this post.

If we have the data $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}\in \mathbb{R}^{N \times M}$, and $\mathbf{Z} \in \mathbb{R}^{N \times L}$, then consider the following three different procedures to regress the data:

  1. Regress $\mathbf{X}$ on $\mathbf{Z}$ to get $\mathbf{X} = \mathbf{Z}\mathbf{\Gamma} + \mathbf{E}$. Define $\tilde{\mathbf{X}} \equiv \mathbf{E}$. Regress $\mathbf{y}$ on $\tilde{\mathbf{X}}$ to get $\mathbf{y} = \tilde{\mathbf{X}} \hat{\boldsymbol{\beta}}_1 + \boldsymbol{\epsilon}_1$.

  2. Concatenate the columns of $\mathbf{X}$ and $\mathbf{Z}$ to get a $N \times (M+L)$ matrix. Call this matrix $[\mathbf{X}\mathbf{Z}]$. Regress $\mathbf{y}$ on $[\mathbf{X}\mathbf{Z}]$ to get $\mathbf{y} = [\mathbf{X}\mathbf{Z}]\hat{\boldsymbol{\beta}}_{2,\text{total}} + \boldsymbol{\epsilon}_2$. Since we did column concatenation, we can write this as $\mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} + \mathbf{Z} \hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} + \boldsymbol{\epsilon}_2$.

  3. With $\tilde{\mathbf{X}}$ defined above, concatenate the columns of $\tilde{\mathbf{X}}$ and $\mathbf{Z}$ to get $[\tilde{\mathbf{X}} \mathbf{Z}]$. Regress $\mathbf{y}$ on $[\tilde{\mathbf{X}} \mathbf{Z}]$ to get $\mathbf{y} = [\tilde{\mathbf{X}} \mathbf{Z}] \hat{\boldsymbol{\beta}}_{3, \text{total}} + \boldsymbol{\epsilon}_3$. We can write this as $\mathbf{y} = \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} + \mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} + \boldsymbol{\epsilon}_3$.

It turns out that:

  1. $\hat{\boldsymbol{\beta}}_1 = \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} = \hat{\boldsymbol{\beta}}_{3, \tilde{\mathbf{X}}}$
  2. $\boldsymbol{\epsilon} _3 = \boldsymbol{\epsilon}_2 = \boldsymbol{\epsilon}_1 - \mathbf{Z}(\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}$

I found these equalities by calculating the coefficients from the OLS regression coefficient formula $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}$ (assuming that $\mathbf{X}^\intercal\mathbf{X}$ is invertible), and I'll show the steps at the end--it's just some linear algebra stuff.

My question is: can we prove these equalities without taking pains to do the matrix algebra? In other words, although I know these equalities hold, I don't know why they hold. It might be worth noting that $\tilde{\mathbf{X}}$ is uncorrelated with $\mathbf{Z}$, but I'm not sure how to make a general argument around that.

----------------Below is the calculation----------------

Since the regressions are all OLS, we have:

$\mathbf{\Gamma} = (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}$,

$\hat{\boldsymbol{\beta}}_1 = (\tilde{\mathbf{X}}^\intercal \tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal\mathbf{y}$

$\hat{\boldsymbol{\beta}}_{2,\text{total}} = ([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1}[\mathbf{X}\mathbf{Z}]^\intercal\mathbf{y}$,

$\hat{\boldsymbol{\beta}}_{3,\text{total}} = ([\tilde{\mathbf{X}}\mathbf{Z}]^\intercal[\tilde{\mathbf{X}}\mathbf{Z}])^{-1}[\tilde{\mathbf{X}}\mathbf{Z}]^\intercal\mathbf{y}$.

We may express $\hat{\boldsymbol{\beta}}_1$, $\hat{\boldsymbol{\beta}}_{2,\text{total}}$, and $\hat{\boldsymbol{\beta}}_{3,\text{total}}$ in terms of $\mathbf{X}$, $\mathbf{y}$, and $\mathbf{Z}$ for the sake of comparison.

First, let's calculate $\hat{\boldsymbol{\beta}}_1$.

Plug in $\mathbf{\Gamma}$ to the definition of $\tilde{\mathbf{X}}$, we get $\tilde{\mathbf{X}} \equiv \mathbf{E} = \mathbf{X} - \mathbf{Z}\mathbf{\Gamma} = \mathbf{X} - \mathbf{Z} (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}$.

To keep the equations compact, define $\mathbf{S} \equiv \mathbf{Z} (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal$. So, $\tilde{\mathbf{X}} = (\mathbf{I} - \mathbf{S})\mathbf{X}$.

Note the following properties of $\mathbf{S}$:

  1. $\mathbf{S}^\intercal = \mathbf{S}$

  2. $(\mathbf{I}-\mathbf{S})^\intercal = \mathbf{I}-\mathbf{S}$

  3. $\mathbf{S} \mathbf{S} = \mathbf{S}$

  4. $(\mathbf{I}-\mathbf{S})(\mathbf{I}-\mathbf{S}) = \mathbf{I}-\mathbf{S}$

  5. $\mathbf{S}(\mathbf{I}-\mathbf{S}) = (\mathbf{I}-\mathbf{S})\mathbf{S} = 0$

Now, plug in $\tilde{\mathbf{X}}$ to the equation for $\hat{\boldsymbol{\beta}}_1$, we get:

$$\begin{align}\hat{\boldsymbol{\beta}}_1 &= (\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})^\intercal (\mathbf{I} - \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})^\intercal\mathbf{y} \\ &=(\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})\mathbf{y}\end{align}$$

To calculate $\hat{\boldsymbol{\beta}}_{2,\text{total}}$, we use the following equation:

$\begin{align}\begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}^{-1}=\begin{bmatrix}\mathbf{P} & -\mathbf{P} \mathbf{B} \mathbf{D}^{-1}\\-\mathbf{D}^{-1}\mathbf{C}\mathbf{P} & \mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{B}\mathbf{D}^{-1}\end{bmatrix}\end{align}$

where $\mathbf{P} = (\mathbf{A} - \mathbf{B} \mathbf{D}^{-1}\mathbf{C})^{-1}$, assuming $\mathbf{D}$ invertible.

With this equation, we can expand $([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1}$ in the equation for $\hat{\boldsymbol{\beta}}_{2,\text{total}}$.

Since $([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1} = \begin{bmatrix}\mathbf{X}^\intercal\mathbf{X} & \mathbf{X}^\intercal\mathbf{Z} \\ \mathbf{Z}^\intercal \mathbf{X} & \mathbf{Z}^\intercal \mathbf{Z}\end{bmatrix}^{-1}$, let $\mathbf{A} = \mathbf{X}^\intercal\mathbf{X}$, $\mathbf{B} = \mathbf{X}^\intercal\mathbf{Z}$, $\mathbf{C} = \mathbf{Z}^\intercal \mathbf{X}$, and $\mathbf{D} = \mathbf{Z}^\intercal\mathbf{Z}$.

So, $\mathbf{P} = (\mathbf{X}^\intercal\mathbf{X}-\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X})^{-1} = (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}$.

As mentioned in the description of the procedure, we can write $\hat{\boldsymbol{\beta}}_{2,\text{total}}$ as the row concatenation of $\hat{\boldsymbol{\beta}}_{2,\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$. We calculate them separately here.

$$\begin{align}\hat{\boldsymbol{\beta}}_{2,\mathbf{X}} &= \mathbf{P}\mathbf{X}^\intercal\mathbf{y} - \mathbf{P}\mathbf{B}\mathbf{D}^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} - (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}\end{align}$$

$$\begin{align}\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} &= -\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{X}^\intercal\mathbf{y} + (\mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{B}\mathbf{D}^{-1})\mathbf{Z}^\intercal\mathbf{y}\\&=-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} + (\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y} + (\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\end{align}$$

To calculate $\hat{\boldsymbol{\beta}}_{3,\text{total}}$, we again calculate $\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}}$ and $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$ separately.

But, to do this, note that we only need to substitute every $\mathbf{X}$ in $\hat{\boldsymbol{\beta}}_{2,\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$ with $\tilde{\mathbf{X}}$.

$$\begin{align}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} &= (\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} \\ &= (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})^\intercal(\mathbf{I}-\mathbf{S})(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} \\ &= (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}\end{align}$$

We're not going to explicitly do the substitution for $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$, and the reason will be clear as we compare the $\boldsymbol{\epsilon}$'s.

So far, we have demonstrated that $\hat{\boldsymbol{\beta}}_1 = \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} = \hat{\boldsymbol{\beta}}_{3, \tilde{\mathbf{X}}}=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}$. We can now demonstrate the relation between $\boldsymbol{\epsilon}_1$, $\boldsymbol{\epsilon}_2$, and $\boldsymbol{\epsilon}_3$.

We directly calculate $\boldsymbol{\epsilon}_1$ and $\boldsymbol{\epsilon}_2$ as follows.

$$\begin{align}\boldsymbol{\epsilon}_1 &= \mathbf{y} - \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_1 \\ &= \mathbf{y} - (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})\mathbf{y}\end{align}$$

$$\begin{align}\boldsymbol{\epsilon}_2 &= \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} - \mathbf{Z} \hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} \\ &= \mathbf{y} -\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} - \mathbf{Z}\left\{-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\right\} \\ &= \mathbf{y} -\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} + \mathbf{S}\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} - \mathbf{S}\mathbf{y} \\ &= \mathbf{y} - (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} - \mathbf{S})\mathbf{y} - \mathbf{S}\mathbf{y} \end{align}$$

Indeed, $\boldsymbol{\epsilon}_2 = \boldsymbol{\epsilon}_1 -\mathbf{S}\mathbf{y}$.

To calculate $\boldsymbol{\epsilon}_3$, we first examine the term $\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$. Remember that we substitute $\mathbf{X}$ in $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$ with $\tilde{\mathbf{X}}$ to get $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$. So,

$\begin{align}\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} &= -\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\ &= -\mathbf{S}\tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+\mathbf{S}\mathbf{y}\end{align}$

But $\mathbf{S}\tilde{\mathbf{X}} = \mathbf{S}(\mathbf{I}-\mathbf{S})\mathbf{X} = 0$ by the fifth property of $\mathbf{S}$. Therefore, $\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} = \mathbf{S}\mathbf{y}$, and

$$\begin{align}\boldsymbol{\epsilon}_3 &= \mathbf{y} - \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} - \mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} \\ &= \mathbf{y} - (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} - \mathbf{S}\mathbf{y}\end{align}$$

Thus, $\boldsymbol{\epsilon}_3=\boldsymbol{\epsilon}_2=\boldsymbol{\epsilon}_1-\mathbf{S}\mathbf{y}$.

  • 1
    Your post might be a little easier to follow if you formulated the problem using standard notation such as $y=X\beta+\varepsilon$ instead of $r=Fb+\epsilon$ and $x_t$ instead of $x(t)$. (If not this time, then at least in the future.) – Richard Hardy Jun 12 '22 at 16:05
  • Something here doesn't look quite right. if $F$ and $B$ are uncorrelated you should get $N \approx 0$, but in the second approach you should still get non zero coefficients because you are regressing on $F$. So the results of (1) and (2) can't be the same – J. Delaney Jun 12 '22 at 17:27
  • I'll rewrite my question to make it clear. – whoknowsnot Jun 13 '22 at 01:04
  • 1
    Your approach seems closely related to the FWL theorem, see e.g. https://bookdown.org/ts_robinson1994/10_fundamental_theorems_for_econometrics/frisch.html Maybe it helps to read in that direction? – Christoph Hanck Jun 13 '22 at 10:03
  • @ChristophHanck Thank you for pointing me to this theorem! I do have a question regarding the proof of this theorem presented in the link. Although it shows that $M_1y=M_2X_2\hat{\beta}_2+\hat{\epsilon}$, it hasn't (at least explicitly) proved that $\hat{\beta}_2$ minimizes $|M_1y-M_2X_2\beta_2|^2$? I understand that by proving this $\hat{\epsilon}$ is the same as that in $y=X_1\hat{\beta}_1+X_2\hat{\beta}_2+\hat{\epsilon}$, we can equivalently show that $\hat{\beta}_2$ minimizes the norm squared of the residual of the latter, but how do we deal with the vanished $\hat{\beta}_1$? – whoknowsnot Jun 14 '22 at 01:33
  • @ChristophHanck As a clarification to my question above, I meant we can equivalently try to show that $\hat{\beta}_2$ minimizes the norm squared, but I'm not sure how to do that without being bogged in the explicit matrix expressions that reduce the elegance of the proof. – whoknowsnot Jun 14 '22 at 01:59
  • https://stats.stackexchange.com/a/113207/919 gives a purely geometric answer. https://stats.stackexchange.com/a/46508/919 gives a more statistical account, and https://stats.stackexchange.com/a/444058/919 is an algebraic one. – whuber Jun 14 '22 at 15:52
  • 1
    It should be $M_1$, no? Indeed, to my knowledge, some explicit matrix expressions are necessary to show that $\hat\beta_2$ is the same in both regressions. – Christoph Hanck Jun 15 '22 at 06:48
  • @ChristophHanck Yes, it was a typo. Thanks for clarifying the proof as well! – whoknowsnot Jun 15 '22 at 06:59

3 Answers3

5

@Sextus Empiricus gives a geometric perspective, while @J. Delaney gives a clever reparameterization. Let me provide an alternative algebraic perspective that appeals to the first order condition of OLS, but avoids tedious matrix computations. Throughout, I will let $\mathbf Y \in \mathbb R^{n\times 1}$ be the column matrix of outcomes and $\mathbf X = \mathbb R^{n\times k}$ be the design matrix such that each row is the values of $\mathbf x$ for a single observation. Let $\beta \in \mathbb R^{k\times 1}$ be the vector of parameters. Also, throughout, I will adopt the econometrician's convention of taking $'$ to mean transpose.

Recall that the OLS estimator is obtained by solving the following minimization problem: $$\hat\beta^{OLS} = \mathrm{argmin}_\beta\, (\mathbf Y - \mathbf X\beta)'(\mathbf Y-\mathbf X\beta)$$ Since this is a strictly convex objective in $\beta$, it suffices to take a first order condition (FOC). Differentiating with respect to $\beta$, the FOC can be rearranged to state that $$\mathbf X'(\mathbf Y - \mathbf X\beta) = 0$$ Consider now, the partition of $\mathbf X = [\mathbf X_1\ \mathbf X_2]$ where $\mathbf X_1\in\mathbb R^{n\times k_1}$, $\mathbf X_2\in\mathbb R^{n\times k_2}$ with $k_1 + k_2 = k$. Similarly, partition $\beta = [\beta_1'\,\beta_2']'$, so that $\mathbf X\beta = X_1\beta_1 + X_2\beta_2$. In that case, we can split the OLS FOC into the two sets of FOCs : $$\mathbf X_1'(\mathbf Y - \mathbf X_1\beta_1 - \mathbf X_2\beta_2) = 0$$ $$\mathbf X_2'(\mathbf Y - \mathbf X_1\beta_1 - \mathbf X_2\beta_2) = 0$$ Let us now rearrange the second set of FOCs to solve for $\beta_2$ in terms of everything else: $$\hat\beta_2^{OLS} = (\mathbf X_2'\mathbf X_2)^{-1}\mathbf X_2'(\mathbf Y - \mathbf X_1\beta_1)$$ We can now plug this expression for $\beta_2$ back into the first set of FOCs to attain $$0 = \mathbf X_1'[\mathbf Y - \mathbf X_1\beta_1 -\mathbf X_2(\mathbf X_2\mathbf X_2)^{-1}\mathbf X_2'(\mathbf Y-\mathbf X_1'\beta_1)]$$ Grouping together the terms involving $\mathbf Y$ and the terms involving $\mathbf X_1\beta_1$, we get $$0 = \mathbf X_1'[(\mathbf I - \mathbf X_2(\mathbf X_2\mathbf X_2)^{-1}\mathbf X_2') \mathbf Y + (\mathbf I - \mathbf X_2(\mathbf X_2\mathbf X_2)^{-1}\mathbf X_2')\mathbf X_1\beta_1]$$ But now, note that $\mathbf M_2 = \mathbf I - \mathbf X_2'(\mathbf X_2\mathbf X_2)^{-1}\mathbf X_2$ is the so-called annialator matrix for $\mathbf X_2$, in the sense that for any vector $\mathbf A\in\mathbb R^{n\times 1}$, $\mathbf M_2 \mathbf A$ can be interpreted as the residuals from the OLS regression of $\mathbf A$ on $\mathbf X_2$. We can solve the FOC above to obtain: $$\hat\beta_1^{OLS} = (\mathbf X_1'\mathbf M_2\mathbf X_1)^{-1}(\mathbf X_1' \mathbf M_2 \mathbf Y)$$ Noting that $\mathbf M_2$ is symmetric ($\mathbf M_2'=\mathbf M_2$) and idempotent ($\mathbf M_2 \mathbf M_2 = \mathbf M_2$), this final expression can, given all of the above, be reinterpreted as the regression of the $\mathbf Y$ residuals on the $\mathbf X_1$ residuals, as stated in your question.

stats_model
  • 2,425
  • Thank you! I really appreciate that your method avoids the block inversion of $[\mathbf{X}_1\mathbf{X}_2]'[\mathbf{X}_1\mathbf{X}_2]$, which makes mine so tedious. And the reinterpretation in the end also allows a nice linkage to the statement of the FWL theorem suggested by @ChristophHanck. – whoknowsnot Jun 14 '22 at 01:54
3

When you have a linear regression of the form

$$ y = X\beta_1 + Z\beta_2 + \varepsilon$$

and $X^TZ = 0$ , then it is equivalent to independently regressing $y$ on $X$ and $Z$. This is easy to see because the matrix

$$([XZ]^T[XZ])^{-1} = \begin{pmatrix} X^TX & 0 \\ 0 & Z^TZ\end{pmatrix}^{-1} = \begin{pmatrix} (X^TX)^{-1} & 0 \\ 0 & (Z^TZ)^{-1}\end{pmatrix}$$

is block diagonal, it's inverse is therefore also block diagonal and we get $\beta_1 = (X^T X)^{-1} X^Ty$ and $\beta_2 = (Z^T Z)^{-1} Z^Ty$.

Now notice that by your construction $Z^T \tilde X = 0$

($Z^T\tilde X = Z^T(X - Z\hat\Gamma) = Z^T X - Z^T Z (Z^T Z)^{-1}Z^TX = Z^T X - Z^T X = 0)$

so the coefficient $\beta_1$ of the regression $y = \tilde X \beta_1$ is the same as in the regression $y = \tilde X \beta_1 + Z\beta_2$, which explains the equivalence of (1) and (3).

In case (2) we can re-parametrize the second coefficient: $y = X\beta_1 + Z\beta_2 = (X-Z\hat\Gamma)\beta_1 + Z(\beta_2 +\hat\Gamma \beta_1) \equiv \tilde X\beta_1 + Z\beta_3$, which brings this to a similar form, so $\beta_1$ here is also equivalent to the two other cases.

J. Delaney
  • 5,380
  • Thank you! The reparametrization is an good perspective, but it seems that it doesn't explicitly prove that $[\beta_1' \beta_2']'$ and $[\beta_1' \beta_3']'$ simultaneously minimize their respective residual squared? In other words, how are we sure that if $[\beta_1' \beta_2']'$ suffices the condition of OLS, then $[\beta_1' \beta_3']'$ also suffices OLS? – whoknowsnot Jun 14 '22 at 02:27
  • @whoknows I've edited the answer to show the explicit inversion of the block-diagonal matrix. Using the standard OLS formula you can solve for $\beta_1$ and $\beta_2$. Is this the part you are asking about ? – J. Delaney Jun 14 '22 at 07:14
  • Thanks, but I was asking about the last paragraph, where you reparametrized $y = X\beta_1 + Z\beta_2$ (case 2, so $X^{\intercal}Z \neq 0$) to get $\tilde{X} \beta_1 + Z \beta_3$. I get that these two are equal, but taking a step back, I'm thinking about how I'd get the coefficients if they were treated as two separate regressions: one is $y$ on $[XZ]$, and the other is $y$ on $[\tilde{X}Z]$--I'd solve for the vector that minimizes the residual squared, respectively... – whoknowsnot Jun 14 '22 at 07:38
  • ... But your parametrization implies that once I've done the first regression and got $\beta_1$ and $\beta_2$, I don't need to do the second regression, but can directly calculate the coefficients as $\beta_1 = \beta_1$ and $\beta_3=\beta_2+\hat{\Gamma}\beta_1$. Why is that? – whoknowsnot Jun 14 '22 at 07:40
  • Not entirely sure I understand the question, but since there is a one-to-one transformation between $(\beta_1,\beta_2)$ and $(\beta_1,\beta_3)$, they are both parametrizing the same space - you can use either one to find the minimum of the objective function. You can always apply such transformations in optimization problem in order to find a more convenient parametrization – J. Delaney Jun 14 '22 at 12:11
  • Thanks for the explanation. It seems that I had some misunderstanding of the word "parametrization". Anyways, I'm not very familiar with the advanced topics of linear algebra; what does "parametrizing the same space" mean? I'm still a bit confused because $X$ and $\tilde{X}$ are right multiplied by the same vector $\beta_1$, but they are different matrices, so what does "the same space" refer to? – whoknowsnot Jun 14 '22 at 13:03
  • What also confuses me is this following case. $y=X\beta_1+Z\beta_2=(X-X)\beta_1+Z(\beta_2+A\beta_1)$, where $ZA=X$. But the coefficient of the regression of $y$ on $Z$ alone of course cannot depend on $X$, whereas $\beta_2+A\beta_1$ does. – whoknowsnot Jun 14 '22 at 13:27
  • In general there will not be a solution for $X=ZA$ (if $N \gt L$), but if there then the problem will be indeed underdetermined - just like in the case $y=Z\beta_1 + Z\beta_2$ : you can only determine the combination $\beta_1+\beta_2$ but not the individual $\beta$'s. – J. Delaney Jun 14 '22 at 13:51
  • $(\beta_1,\beta_2)$ define a space of possible coefficients. We are searching for a point in this space that minimizes the objective function. There is a one-to-one mapping between every point in this space to a point in $(\beta_1,\beta_3)$, so you can think of it as just different labeling of points in the same space. This is what "parametrizing the same space" means. (just like you can describe points in $\mathbb R^2$ using $(x,y)$ or $(r,\theta)$, for example) – J. Delaney Jun 14 '22 at 14:05
  • Thanks for the clarification. But aren't the objective function for $(\beta_1,\beta_2)$ and the objective function for $(\beta_1,\beta_3)$ different, since $[XZ]$ is different from $[\tilde{X}Z]$? – whoknowsnot Jun 14 '22 at 14:27
  • 1
    The objective function is the same : $|y - (X\beta_1+Z\beta_2)|^2 = |y - (\tilde X\beta_1+Z\beta_3)|^2$ – J. Delaney Jun 14 '22 at 17:25
2

Below is a geometric viewpoint similar to an answer to a different question: Intuition behind $(X^TX)^{-1}$ in closed form of w in Linear Regression

image from other question

The regression is a perpendicular projection onto the vectors in the columns of $X$ and $Z$. What you are basically doing is defining a different vector $\bar{X}$ such that the coordinates associated with the projection remain the same.

This alternative vector is drawn in the image with a red on the right side.

The vector $\bar{X}$ is perpendicular to $Z$ and that is why all those coefficients $\beta$ turn out to be the same.

If $Z$ and $\bar{X}$ are perpendicular then

  • The regression $$Y \sim \beta_1 \bar{X} + \beta_2 Z$$ and $$Y \sim \beta_1^\prime \bar{X} $$ will be the same in the sense $\beta_1 = \beta_1^\prime$

  • The regression $$Y \sim \beta_1 \bar{X} + \beta_2 Z$$ and $$Y \sim \beta_1^{\prime\prime} (\bar{X} + a Z) + \beta_2 Z$$ with $a$ some constant, will be the same in the sense $\beta_1 = \beta_1^{\prime\prime}$.

    Note that we can write $X = \bar{X} + a Z$. The difference between $X$ and $\bar{X}$ is some multiple of $Z$.

  • This is a great perspective! I had almost forgot how regression can be viewed as perpendicular projection. I like the illustration, especially that from the geometric relation between the red solid arrow and the red dotted arrow, we can intuitively see how the geometric argument for the $\beta$ of $\tilde{X}$ can be naturally extended to the $\beta$ of $\tilde{X}+aZ$. – whoknowsnot Jun 14 '22 at 16:00