Why are least-squares parameters normally distributed?

Question

I am trying to figure out why the parameter $$\begin{equation*} \hat\beta = (X^TX)^{-1}X^TY \end{equation*}$$ is normally distributed in least-squares prediction. (Where Y is a linear function plus normal noise.) All the examples I've found have said that since $$\begin{align*} \hat\beta &= (X^TX)^{-1}X^TY \\ &= (X^TX)^{-1}X^T(X\beta + \varepsilon) \\ &= \beta + (X^TX)^{-1}X^T\varepsilon \end{align*}$$ we know that $$\hat\beta-\beta \sim \mathcal{N}(0,\sigma^2 (X^TX)^{-1})$$

I can see how the mean and variance are calculated, but why is this a normal distribution?

Possible duplicate of Help clarify the implication of normality in an Ordinary Least Square (OLS) Regression — Xi'an, Jan 05 '17 at 05:02
Also be aware that under certain regularity conditions, the distribution of $\hat{\beta}$ will be asymptotically normal as the number of observations $n \rightarrow \infty$. For the asymptotic argument, you don't need $\epsilon$ to be normal (but you do need conditions such that a central limit theorem and other asymptotic arguments apply). — Matthew Gunn, Jan 05 '17 at 08:39

Ben · Answer 1 · 2020-03-25T11:53:29.677

4

In classical statistics the parameter value $\beta$ in a linear regression model is an unknown constant. The value $\hat{\beta}$ is not a parameter - it is an estimator of the parameter, which is a function of the data. The reason this estimator is normally distributed is that it is a linear function of the underlying error vector (as written in the equation you have shown), which is normally distributed under the model assumptions. (Note that even if you relax of the normality assumption, the parameter estimator will still be a summation quantity for which you can invoke the CLT under fairly general conditions; so the distribution of the parameter estimator will converge to the normal distribution under broad conditions even if the model is misspecified.)

edited Mar 25 '20 at 11:53

answered Jul 26 '18 at 04:59

Ben

124,856

What do you mean by the estimator being a summation quantity? Do you mean that it's a summation over the random unobserved errors? – 24n8 Jul 10 '20 at 03:06
Yes. You can see from the above that it is of the form $\hat{\beta} = \beta + A \varepsilon$ where $A$ is an appropriate transformation matrix. This means that the estimator is an affine transformation of $\varepsilon$ – Ben Jul 10 '20 at 03:13
Ah yes, so this seems to imply that CLT also applies to weighted sums? Also, I think one issue with CLT here could potentiall be if the errors aren't independently distributed. In that case, CLT would not work? – 24n8 Jul 10 '20 at 03:22
Yes, but it depends on specifics. There are broad CLTs that apply for weighted sums and for correlated errors. Broadly speaking, these work only if the correlation is weak enough, and the weighting is diffuse enough that no finite set of error terms "dominate" the sum in the limit. – Ben Jul 10 '20 at 03:33

Why are least-squares parameters normally distributed?

1 Answers1

Linked