1

In Elements of Statistical Learning in equation 3.11, the distribution of (sum of prediction errors) is approximated to actual error variance * chi-square distribution of $N-p-1$ degrees of freedom. Can you please explain how this approximation happened?

$$(N-p-1)\hat\sigma^2\sim\sigma^2\chi_{N-p-1}^2,\qquad (3.11)$$

$$\hat\sigma^2=\frac{1}{N-p-1}\sum_{i=1}^N(y_i-\hat{y}_i)^2.$$

utobi
  • 11,726
  • 1
    The "$\sim$" in (3.11) does not mean "approximated" or "approximation", it should be read as "distributed as". – Zhanxiong Nov 08 '23 at 17:26

2 Answers2

1

We first derive a nice formulation of $\sum_{i=1}^N(y_i-\hat{y_i})^2$:

\begin{align*} \sum_{i=1}^N(y_i-\hat{y_i})^2 &=\|y-X\hat{\beta}\|^2\\ &=\|X\beta+\epsilon-X(X^TX)^{-1}X^T(X\beta+\epsilon)\|^2\\ &=\|(I_N-H)\epsilon\|^2\\ &=\epsilon^T(I_N-H)^T(I_N-H)\epsilon\\ &=\epsilon^T(I_N-H)^2\epsilon\\ &=\epsilon^T(I_N-H)\epsilon \end{align*} with the first equality by definition of $y_i$, the second equality by $y=X\beta+\epsilon$, the third equality by linear algebra manipulations, the fourth equality by definition of $\|\bullet \|^2$, the fifth equality by $H^T=H$, the sixth equality by the fact that $I_N-H$ is the orthogonal projection onto the complement of the column space of $X$ and the fact that projecting two times is the same as projecting one time.

With this nice formulation in hand, we have $(N-p-1)\hat{\sigma}^2=\epsilon^T(I_N-H)\epsilon$ with $H$ a projection matrix onto the column space of $X$. So, $I_N-H$ is the projection onto the orthogonal complement of the column space of $X$, which has dimension $N-p-1$. Suppose $v_1,...,v_{p+1}$ span the column space of $X$ and $w_1,...,w_{N-p-1}$ span its orthogonal complement, then $$\epsilon^T(I_N-H)\epsilon=\sum_{i=1}^{N-p-1}(w_i^T\epsilon)^2,$$ which is the sum of $N-p-1$ standard normal variables, a $\chi^2$-distribution with $N-p-1$ degree of freedom.

kid111
  • 111
-2

I think I understood how this approximation worked.

the approximation from DF(n) to DF(n-p-1) is from cochran's theorem explained here where n is number of samples and (p+1) is number of features.

enter image description here

  • The value $n-p-1$ makes a magical appearance when the DF for the chi-squared distribution suddenly changes from $n$ to $n-p-1$. I don't think anybody would find this enlightening or convincing without further explanation. – whuber Feb 02 '18 at 19:18
  • DF changed from n to n-p-1 because we are using Sample mean instead of population mean. here n is number of samples and (p+1) number of features. if we have only 1 feature then DF = n-1 for approximation. This approximation is from Cochran's theorem and better explained here – Paras Malik Feb 03 '18 at 08:04
  • Understood: but the DF doesn't magically change from one equation to the next. When you changed "$\chi^2_n$ to "$\chi^2_{n-p-1}$" you failed to change "$\sigma$" in the denominator of the left hand side to "$\hat \sigma$". – whuber Feb 03 '18 at 13:15
  • It should be σ only. Explained here and here – Paras Malik Feb 04 '18 at 03:12