4

I am taking a regression course. Suppose there are two linear regression models

\begin{aligned} M_1: & y = \beta_0 + x_2\beta_2 + \epsilon \\ M_2: & y = \beta_0 + x_1\beta_1 + x_2\beta_2 + \epsilon \end{aligned}

We want to know if we should add $X_1$ in the model (i.e. choose $M_1$ or $M_2$). Thus, we utilize a statistic named $R_{Y1|2}^2$. It represents the proportion of variation that can be explained by $X_1$ given $X_2$ is already in the model. And it can be defined in two ways.

$$ R_{Y1|2}^2 = \frac{SSE(X_2) - SSE(X_1, X_2)}{SSE(X_2)} $$

$$ R_{Y1|2}^2 = \frac{(r_{Y,X_1} - r_{Y,X_2}r_{X_1,X_2})^2}{(1-r_{X_1,X_2}^2)(1-r_{Y,X_2}^2)} $$

I was just wondering why they are equivalent. To prove it, I tried to use some linear algebra.

$$ \frac{SSE(X_2) - SSE(X_1, X_2)}{SSE(X_2)} = \frac{Y^\top(H_{12} - H_{2})Y}{Y^\top(I - H_{2})Y} $$ and

$$ \frac{(r_{Y,X_1} - r_{Y,X_2}r_{X_1,X_2})^2}{(1-r_{X_1,X_2}^2)(1-r_{Y,X_2}^2)} = \frac{Y^\top(I - H_{2})X_1}{Y^\top(I - H_{2})Y X_1^\top(I - H_{2})X_1} $$

where $H_2$ is the hat matrix of $X = [1, X_2]$ and $H_{12}$ is the hat matrix of $X' = [1, X_1, X_2]$. However, I am stuck here. If anyone can help me out, I would be very grateful.

1 Answers1

1

This is a problem much harder than it looks (+1).

The expression $\frac{(r_{Y,X_1} - r_{Y,X_2}r_{X_1,X_2})^2}{(1-r_{X_1,X_2}^2)(1-r_{Y,X_2}^2)} = \frac{Y^\top(I - H_{2})X_1}{Y^\top(I - H_{2})Y X_1^\top(I - H_{2})X_1}$ is off, it is much more verbose than that. Let me first correct that for you. Using the fact that in simple linear regression, the coefficient of simple correlation equals to the Pearson correlation between $y$ and $x$, we have (to save some typing, from now on I will write $Y, X_1, X_2$ as $y, x_1, x_2$, and the transpose notation $\top$ is written as $'$):
\begin{align} & r_{y, x_1} = \frac{y'Cx_1}{\sqrt{y'Cy \cdot x_1'Cx_1}}, \\ & r_{y, x_2} = \frac{y'Cx_2}{\sqrt{y'Cy \cdot x_2'Cx_2}}, \\ & r_{x_1, x_2} = \frac{x_1'Cx_2}{\sqrt{x_1'Cx_1 \cdot x_2'Cx_2}}, \end{align} where $C = I - q_1q_1'$ and $q_1 = \mathbf{1}/\sqrt{n}$ (by convention, $\mathbf{1}$ is the $n$-long column vector of all ones). Plugging them to $\frac{(r_{y,x_1} - r_{y,x_2}r_{x_1,x_2})^2}{(1-r_{x_1,x_2}^2)(1-r_{y,x_2}^2)}$ and doing tedious elementary algebra gives \begin{align} & \frac{(r_{y,x_1} - r_{y,x_2}r_{x_1,x_2})^2}{(1-r_{x_1,x_2}^2)(1-r_{y,x_2}^2)} \\ =& \frac{(y'Cx_1 \cdot x_2'Cx_2 - y'Cx_2 \cdot x_1'Cx_2)^2}{(x_1'Cx_1 \cdot x_2'Cx_2 - (x_1'Cx_2)^2)(y'Cy \cdot x_2'Cx_2 - (y'Cx_2)^2)}. \tag{1} \end{align} To establish the goal equality, consider the QR decomposition of the raw design matrix $X = \begin{bmatrix} \mathbf{1} & x_2 & x_1\end{bmatrix}$. Suppose $X = QR$, where the columns of $Q = \begin{bmatrix} q_1 & q_2 & q_3\end{bmatrix}$ are orthonormalized vectors of $\mathbf{1}, x_2$ and $x_1$ obtained by the Gram-Schmidt orthogonalization procedure. More explicitly, there exist real numbers $a, b, c, d, e$ such that \begin{align} & \mathbf{1} = \sqrt{n}q_1, \\ & x_2 = aq_1 + bq_2, \\ & x_1 = cq_1 + dq_2 + eq_3. \tag{2} \end{align} In terms of $q_1, q_2, q_3$, $\frac{SSE(x_2) - SSE(x_1, x_2)}{SSE(x_2)}$ can be expressed as (note that the hat matrix $H = X(X'X)^{-1}X' = QQ' = q_1q_1' + q_2q_2' + q_3q_3'$): \begin{align} \frac{SSE(x_2) - SSE(x_1, x_2)}{SSE(x_2)} = \frac{y'(H - \tilde{H})y}{y'(I - \tilde{H})y} = \frac{(y'q_3)^2}{y'(I - q_1q_1' - q_2q_2')y}. \tag{3} \end{align} Here $\tilde{H} = \tilde{X}(\tilde{X}'\tilde{X})^{-1}\tilde{X}'$ and $\tilde{X} = \begin{bmatrix} \mathbf{1} & x_2 \end{bmatrix}$. It is therefore easy to verify that $\tilde{H} = q_1q_1' + q_2q_2')$.

To simplify $(1)$, replacing $x_1, x_2$ in $(1)$ in terms of $q_1, q_2, q_3$ by $(2)$ and utilizing the orthonormality of the group $\{q_1, q_2, q_3\}$. It can be shown that (the actual computation is quite lengthy, yet not difficult, so I omit the details here): \begin{align} & (y'Cx_1 \cdot x_2'Cx_2 - y'Cx_2 \cdot x_1'Cx_2)^2 = b^4e^2(y'q_3)^2, \\ & (x_1'Cx_1 \cdot x_2'Cx_2 - (x_1'Cx_2)^2)(y'Cy \cdot x_2'Cx_2 - (y'Cx_2)^2) = b^4e^2y'(I - q_1q_1' - q_2q_2')y. \end{align} The factor $b^4e^2$ is cancelled out when forming the ratio, hence the goal equality holds (comparing $(3)$). This completes the proof.

Zhanxiong
  • 18,524
  • 1
  • 40
  • 73