I am trying to understand how simulation can be used to check if (regression) model assumptions are met.
For example here is a linear regression model: $$y = \beta_0 + \beta_1x + \epsilon$$
I understand that the errors and the response have the following distributions:
$$\epsilon \sim N(0, \sigma^2)$$
$$f(\epsilon) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{\epsilon^2}{2\sigma^2}}$$
$$y|x \sim N(\beta_0 + \beta_1x, \sigma^2)$$
$$f(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y-\beta_0 - \beta_1x)^2}{2\sigma^2}}$$
Once the model has been fit on some data, we get the following distributions:
$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x$$
$$\hat{\epsilon} = y - \hat{y}$$ $$\hat{\epsilon} \sim N(0, \sigma^2)$$
$$\hat{y}|x \sim N(\hat{\beta}_0 + \hat{\beta}_1x, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right))$$
$$\hat{\beta}_1 \sim N\left(\hat{\beta}_1, \frac{\hat{\sigma}^2}{\sum (x_i - \bar{x})^2}\right)$$
$$\hat{\beta}_0 \sim N\left(\hat{\beta}_0, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i - \bar{x})^2} \right)\right)$$
From here, using the properties of the Normal Distribution (i.e. adding normal distributions together produces a normal distribution and subtracting normal distributions from each other also produces a normal distribution ), I can see that $\hat{\epsilon}$ can ONLY be Normally Distributed if the distribution of the observed $y$ conditional on $x$ (i.e. $y|x$), $\hat{\beta}_0$ and $\hat{\beta}_1$ are ALL Normally Distributed. If even one of them are not normally distributed, then (in theory) $\hat{\epsilon}$ will not be normally distributed. (I can also see that if $\hat{\sigma}$ is non-constant i.e. heteroskedastic, then the distributions of $\hat{\beta}_0$ and $\hat{\beta}_1$ might not be normal).
So here is my question: How can simulating new data points be used to help check if the model assumptions are being met?
I can understand the logic behind a cross validation style approach: Fit the model on 70% of the data, and see if the predicted errors (i.e. residuals) from the other 30% of the data is Normally Distributed (i.e. using QQ plots or Kolmogorov-Smirnov).
But how can you simulate completely new pairs of data points ($x_i, y_i$) and use them to test if the model assumptions are being met? And how will this help?
- Step 1: For example, I can choose some value of $x_i$ and simulate multiple values of $y_i$ for this $x_i$ . Each $y_i$ will be a random realization from a $N(\hat{\beta}_0 + \hat{\beta}_1x, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right))$.
- Step 2: I can then make my model predict the value of each $\hat{y_i}$.
- Step 3: Then, I can check the distribution of all $\hat{y_i}$ to see if they are Normally Distributed
- I can repeat Step 1 - Step 3 for new values of $x_i$ and use a QQ plot to check if all these error/residuals are (collectively) normally distributed.
The way I see it, in this simulation, all $\hat{\epsilon_i} = y_i - \hat{y_i}$ will only be Normally Distributed if my regression model always produces normally distributed $\hat{y_i}$. And my regression model can only produce normally distributed $\hat{y_i}$ if the data used to create the regression model was conditionally normal.
Is this the correct understanding? Or is this a circular argument and the errors from the simulated points will necessarily be normally distributed even if the model assumptions are not met ?
r = replicate(10^4,{x = 1:5; y = x+rnorm(5); m = lm(y~x); m$residuals[c(1,3)]});plot(t(r)); var(r[1,]); var(r[2,]);– Sextus Empiricus Feb 26 '24 at 15:55