3

Excerpt from "Elements of Statistical Learning", p.47

Assume that the conditional expectation of $Y$ is linear in $X_1, \ldots, X_p$. Also assume that the deviations of $Y$ around its expectation are additive and Gaussian. Hence $$Y = E(Y \mid X_1, \ldots, X_p) + \varepsilon,$$ $$ = \beta_0 + \sum_{j = 1}^p X_j \beta_j + \varepsilon,$$ where the error $\varepsilon \sim N(0,\sigma^2)$ is a Gaussian random variable.

It is then easy to show that $\hat \beta \sim N(\beta, (X^t X)^{-1} \sigma^2)$ (1) and that $(N - p - 1)\hat \sigma^2 \sim \sigma^2 \chi^2_{N-p-1}$ (2).

Earlier they also assume that the observations $y_i$ are uncorrelated and have constant variance $\sigma^2$, and the $x_i$ are fixed.


Question

This part of ESL is repetition of basics that I'm trying to rehash since it was a long time ago that I studied this stuff.

For (1) I can show that $\mathrm{Cov}(\hat \beta) = (X^t X)^{-1} \sigma^2$ and of course $E(\hat \beta) = \beta$, but don't I need to know that $Y$ is normally distributed to be able to tell the distribution of $\hat \beta \sim N(E(\hat \beta), \mathrm{Cov}(\hat \beta))$?

For (2) I know that the hat matrix $X(X^t X)^{-1}X^t$ is idempotent, has rank $p + 1$ and $X^t \varepsilon = 0$. How can I finish (2)?

Lejoon
  • 175
  • 6
  • 1
    More rigorously, it means $\hat{\beta} | X \sim N(\beta, (X^TX)^{-1}\sigma^2)$ and $(N - p - 1)\hat{\sigma}^2 | X \sim \sigma^2\chi_{N - p - 1}^2$. That is, these are conditional distributions instead of marginal distributions. – Zhanxiong Jan 08 '23 at 20:32
  • That makes sense. Here we can assume that the $x_i$ are fixed/deterministic. – Lejoon Jan 08 '23 at 20:34
  • 1
    Note that the book also mentioned that "$x_i$ are fixed (non random)". In this case, the conditional distributions coincide marginal distributions. If you still have difficulty to derive them, I can post an answer. – Zhanxiong Jan 08 '23 at 20:34
  • I have trouble knowing how I can assume that $\hat \beta$ is normally distributed. I do know its expectation and covariance matrix.

    As for the second question I assume it has something to do with $X(X^tX)^{-1}X^t$ being a projection matrix hence splitting $N$ into the projection which has dimension $p + 1$ and the rest which is of dimension $N - (p + 1)$, however I can't seem to hammer down the details.

    – Lejoon Jan 08 '23 at 20:37

1 Answers1

2

The normality of $\hat{\beta}$ follows from the fact that any affine transformation of a normal random vector is still normal. Here, in matrix form, $\hat{\beta} = (X'X)^{-1}X'Y = \beta + (X'X)^{-1}X'\epsilon$ is an affine transformation of the $N$-dimensional random vector $\epsilon \sim N(0, \sigma^2 I_{(N)})$. Hence the first result.

The second one regards the distribution of the residual vector $$\hat{\epsilon} = (I_{(N)} - H)Y = (I_{(N)} - H)(X\beta + \epsilon) = (I_{(N)} - H)\epsilon,$$

where $H = X(X'X)^{-1}X'$.

Since $(N - p - 1)\hat{\sigma}^2 = \hat{\epsilon}'\hat{\epsilon} = \epsilon'(I_{(N)} - H)\epsilon$, $\sigma^{-1}\epsilon \sim N(0, I_{(N)})$, the second result follows from the distribution of quadratic form of multivariate normal random vectors (see, for example, Theorem 1.4.2 in Aspects of Multivariate Statistical Theory by R. Muirhead).


For your convenience (and also because of the importance of this theorem), I copy it here:

If $X$ is $N_m(\mu, I_m)$ and $B$ is an $m \times m$ symmetric matrix then $X'BX$ has a noncentral $\chi^2$ distribution if and only if $B$ is idempotent, in which case the degrees of freedom and the noncentrality parameter are respectively $k = \operatorname{rank}(B) = \operatorname{tr}(B)$ and $\delta = \mu'B\mu$.

The idea of the proof is that $B$ has the canonical form $B = H\operatorname{diag}(I_{(k)}, 0)H'$ with $H$ orthogonal when $B$ is idempotent and rank $k$. The closedness of multivariate normal distribution under the affine transformation again plays a role here.

Zhanxiong
  • 18,524
  • 1
  • 40
  • 73