1

What is the distribution of $E[Y|X]$ (=$X\hat{\beta}$) and $X\hat{\beta} + \epsilon$ in a multivariate linear regression

There are several places this question has been answered implicitly or explicity. I am asking this question because:

  1. The other answers never prove why the model prediction is t-distributed or normally distributed
  2. There are places where it is claimed to be normally distributed but in most of the places it is claimed to be t-distributed

References

  1. Show confidence limits and prediction limits in scatter plot - uses t-distribution in the code
  2. Obtaining a formula for prediction limits in a linear model (i.e.: prediction intervals) - Claims the model predictions are normally distributed
  3. What is the distribution of the predictions in linear regression? - The answer reference 3.5 from faraway 2002. But the section does not justify t-distribution. There is also a comment on the answer that says "The distribution of estimates and predictions is Gaussian. But for the computation of confidence intervals or prediction intervals we use a t-distribution. The question asks for the former." - why?

1 Answers1

0

It is common to assume the errors are $iid$ Gaussian (so the conditional distributions $Y\vert X$ are Gaussian) because the maximum likelihood estimation is equivalent to minimizing the sum of squares. In the case of these $iid$ Gaussian errors, then the coefficients are t-distributed in the familiar way (at least under the null hypothesis where the coefficients equal zero).

However, that Gaussian assumption does not have to hold for you to fit an OLS linear regression and get a good fit. Because of various convergence theorems, even if the errors are fairly non-Gaussian, in large sample sizes, then coefficients will be nearly t-distributed (again, at least under the null hypothesis). There isn’t even an assumption about Gaussian errors or Gaussian conditional distributions for the Gauss-Markov theorem.

Thus, there isn’t a guarantee about how the conditional distributions are shaped.

And there really isn’t a guarantee or assumption about how the overall distribution of $Y$ is shaped, even in the situation where the errors are $iid$ Gaussian.

Dave
  • 62,186
  • I'm sorry but I do not understand what you are trying to say. Would it be possible to derive, or point me to a resource that derives, the distribution of the $\hat{y}$ and $E[Y|X]$. I know how the parameters are t-distributed under iid gaussian error assumption. – figs_and_nuts Mar 02 '24 at 17:46
  • Why should those have any particular distribution? Those quantities depend on the features. – Dave Mar 02 '24 at 17:51
  • I do not know if they have a particular distribution but then if they do not, How can all the answers and the literature (some of which I have linked in my answer) claim a confidence interval? Confidence intervals would be a consequence of a distribution – figs_and_nuts Mar 02 '24 at 20:57
  • $1)$ There are many incorrect and ambiguous claims out there. $2)$ What exactly is claimed about the confidence interval? (Or do you mean a prediction interval?) $//$ It will help if you clarify exactly what distribution you want to know and what assumptions you are making. Assuming $iid$ Gaussian errors and a model with correct specification leads somewhere different than assuming arbitrary error terms. – Dave Mar 02 '24 at 21:00
  • I am looking for a justification of using the t-distribution for both the confidence interval (for $E[Y|X]$) and the prediction interval (for $\hat{y}$). I am making all the inference assumptions that are made for linear regression. True DGP is linear, errors are iid normal, homoscedasticity, no multicollinearity, no serial correlation etc. If i'm making all these assumptions then what can be said about the distributions of the predictions and the expected value of the prediction – figs_and_nuts Mar 02 '24 at 21:12