3

I am trying to learn how to derive the distributions of terms in Linear Regression Models (both theoretical model and observed model):

For example here is a linear regression model: $$y = \beta_0 + \beta_1x + \epsilon$$

Theoretical Model:

$$\epsilon \sim N(0, \sigma^2)$$

$$f(\epsilon) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{\epsilon^2}{2\sigma^2}}$$

$$y|x \sim N(\beta_0 + \beta_1x, \sigma^2)$$

$$f(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y-\beta_0 - \beta_1x)^2}{2\sigma^2}}$$

Observed Model:

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x$$

$$\hat{\epsilon} = y - \hat{y}$$ $$\hat{\epsilon} \sim N(0, \sigma^2)$$

$$\hat{y}|x \sim N(\hat{\beta}_0 + \hat{\beta}_1x, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right))$$

$$\hat{\beta}_1 \sim N\left(\hat{\beta}_1, \frac{\hat{\sigma}^2}{\sum (x_i - \bar{x})^2}\right)$$

$$\hat{\beta}_0 \sim N\left(\hat{\beta}_0, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i - \bar{x})^2} \right)\right)$$

Apparently I have done this incorrectly (see comments: How does simulation help check if model assumptions are met?).

Can someone please show me where this is wrong? We don't define the marginal distribution of $Y$ or $\hat{Y}$, correct?

1 Answers1

5

The only thing I can see is how you specify the distribution of $\hat{\epsilon}$. You wrote $\hat{\epsilon} \sim N(0, \sigma^2$) which would imply that in any sample the residuals $\hat{\epsilon}$ or $e$ would have the same distribution as the true unknown error terms $\epsilon$. That would be too good to be true. Consider the following: \begin{align} e &= Y-\hat{Y} \\&= Y - Xb \\&= Y - X(X'X)^{-1}X'Y)\\&=(I-X(X'X)^{-1}X')Y \\&= MY \\&= M(X\beta + \epsilon)\\&= MX\beta + M\epsilon \\&= 0 + M\epsilon \\&= M\epsilon \end{align}

So, the residuals $e$ are some transformation of the true errors $\epsilon$. Also, the sum of squared residuals SSE or $e'e$ is not equal to the sum of squared errors: $e'e = \epsilon'M'M\epsilon = \epsilon'M\epsilon$, with $M'M = M$ because $M$ is symmetric and idempotent. This finally leads to the conclusion that the expected value of the sum of squared residuals $E[e'e] = (N-K)\sigma^2$ and hence, that an unbiased estimate of the variance of the true error terms can be obtained by $s^2=\frac{\Sigma e^2}{N-K}$ with N and K denoting the number of cases and the number of regression coefficients, including the intercept.

Noah
  • 33,180
  • 3
  • 47
  • 105
BenP
  • 1,124
  • thank you so much for your answer... i accepted it as the answer. if you have time, can you please expand on this? I am still a bit confused... can you please derive the distribution of the observed residuals from the start? thank you so much... – Uk rain troll Mar 02 '24 at 16:53