0

I am trying to understand how simulation can be used to check if (regression) model assumptions are met.

For example here is a linear regression model: $$y = \beta_0 + \beta_1x + \epsilon$$

I understand that the errors and the response have the following distributions:

$$\epsilon \sim N(0, \sigma^2)$$

$$f(\epsilon) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{\epsilon^2}{2\sigma^2}}$$

$$y|x \sim N(\beta_0 + \beta_1x, \sigma^2)$$

$$f(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y-\beta_0 - \beta_1x)^2}{2\sigma^2}}$$

Once the model has been fit on some data, we get the following distributions:

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x$$

$$\hat{\epsilon} = y - \hat{y}$$ $$\hat{\epsilon} \sim N(0, \sigma^2)$$

$$\hat{y}|x \sim N(\hat{\beta}_0 + \hat{\beta}_1x, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right))$$

$$\hat{\beta}_1 \sim N\left(\hat{\beta}_1, \frac{\hat{\sigma}^2}{\sum (x_i - \bar{x})^2}\right)$$

$$\hat{\beta}_0 \sim N\left(\hat{\beta}_0, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i - \bar{x})^2} \right)\right)$$

From here, using the properties of the Normal Distribution (i.e. adding normal distributions together produces a normal distribution and subtracting normal distributions from each other also produces a normal distribution ), I can see that $\hat{\epsilon}$ can ONLY be Normally Distributed if the distribution of the observed $y$ conditional on $x$ (i.e. $y|x$), $\hat{\beta}_0$ and $\hat{\beta}_1$ are ALL Normally Distributed. If even one of them are not normally distributed, then (in theory) $\hat{\epsilon}$ will not be normally distributed. (I can also see that if $\hat{\sigma}$ is non-constant i.e. heteroskedastic, then the distributions of $\hat{\beta}_0$ and $\hat{\beta}_1$ might not be normal).

So here is my question: How can simulating new data points be used to help check if the model assumptions are being met?

I can understand the logic behind a cross validation style approach: Fit the model on 70% of the data, and see if the predicted errors (i.e. residuals) from the other 30% of the data is Normally Distributed (i.e. using QQ plots or Kolmogorov-Smirnov).

But how can you simulate completely new pairs of data points ($x_i, y_i$) and use them to test if the model assumptions are being met? And how will this help?

  • Step 1: For example, I can choose some value of $x_i$ and simulate multiple values of $y_i$ for this $x_i$ . Each $y_i$ will be a random realization from a $N(\hat{\beta}_0 + \hat{\beta}_1x, \hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right))$.
  • Step 2: I can then make my model predict the value of each $\hat{y_i}$.
  • Step 3: Then, I can check the distribution of all $\hat{y_i}$ to see if they are Normally Distributed
  • I can repeat Step 1 - Step 3 for new values of $x_i$ and use a QQ plot to check if all these error/residuals are (collectively) normally distributed.

The way I see it, in this simulation, all $\hat{\epsilon_i} = y_i - \hat{y_i}$ will only be Normally Distributed if my regression model always produces normally distributed $\hat{y_i}$. And my regression model can only produce normally distributed $\hat{y_i}$ if the data used to create the regression model was conditionally normal.

Is this the correct understanding? Or is this a circular argument and the errors from the simulated points will necessarily be normally distributed even if the model assumptions are not met ?

  • You state that $$\hat{\epsilon} \sim N(0, \sigma^2)$$ But, if the $y$ are normal distributed, then indeed $\hat{\epsilon}$ follows a normal distribution as well. However, it is a slightly more complex expression. (and also, the true relationship between $y$ and $x$ needs to be in the space of the models, or otherwise there will be a bias as well). The distribution of the residuals is not the same as the distribution of the error terms. – Sextus Empiricus Feb 26 '24 at 15:39
  • Also the expressions for $\hat{y}|x$, $\hat{\beta}_1$ and $\hat{\beta}_2$ are not so clear. – Sextus Empiricus Feb 26 '24 at 15:44
  • As example of a the residuals you can consider the simulation r = replicate(10^4,{x = 1:5; y = x+rnorm(5); m = lm(y~x); m$residuals[c(1,3)]});plot(t(r)); var(r[1,]); var(r[2,]); – Sextus Empiricus Feb 26 '24 at 15:55
  • By $\hat{\epsilon}$ you are referring to the residuals or to estimates of the error terms? – Sextus Empiricus Feb 26 '24 at 15:58
  • $$\hat{\epsilon}$ is the estimate of the error terms – Uk rain troll Feb 26 '24 at 18:19
  • beta1_hat, beta2_hat are the distributions of the parameter estimates (I think they are asymptotically normal) – Uk rain troll Feb 26 '24 at 18:21
  • I am feeling lost and overwhelmed by the different notations and distributions ... can you please help me sort this out ? what are the correct theoretical/empirical distributions? and how can simulating new data points assist in confirming if the assumptions are met or not? Thank you so much.... – Uk rain troll Feb 26 '24 at 18:23
  • Do you mean simulating new data points or measuring new data points? – Sextus Empiricus Feb 26 '24 at 20:02
  • I may post an answer about the distribution of $\hat{\beta}_1$ and $\hat{\beta}_2$, but I do not understand what you mean by $\hat{\epsilon}$. – Sextus Empiricus Feb 26 '24 at 20:03
  • I thought that epsilon hat is the observed residual? in practice, these observed residuals are supposed to have a Normal Distribution... this means that the model fits well? – Uk rain troll Feb 26 '24 at 20:56
  • I am just confused... is it really possible to simulate completely new pairs of points ... and then use them to check if the model and data satisfy the required assumptions? – Uk rain troll Feb 26 '24 at 20:56
  • I always hear that simulation is important in statistics and simulation can be used to see if statistical assumptions are being met... but it is unclear to me how and why? – Uk rain troll Feb 26 '24 at 20:59
  • The hat symbol $\hat{}$ is used for estimates. The errors $\epsilon$ are normally not estimated (instead the parameters of the error distribution, like variance, are estimated). – Sextus Empiricus Feb 26 '24 at 21:18
  • "is it really possible to simulate completely new pairs of points" I don't understand what you mean by simulation in this comment and in the question. – Sextus Empiricus Feb 26 '24 at 21:19
  • my apologies... I think I have overcomplicated everything. Let me restart. Suppose I fit a simple linear regression model to some data. How can simulation help me determine if the necessary statistical assumptions are being met? Thank you so much,,, – Uk rain troll Feb 26 '24 at 22:40
  • hi sextus ... does my latest comment make sense? – Uk rain troll Feb 27 '24 at 16:00
  • There is no point simulating based on the statistical assumptions to see whether the simulated output satisfies these assumptions, if that is what you are asking. – Graham Bornholt Feb 27 '24 at 18:43

0 Answers0