4

When we have a multivariate regression function, which assumption has to hold so that the OLS assumptions are not violated:

a) each variable must follow a normal distribution
b) all variables together must follow a normal distribution
c) only the dependent variable must follow a normal distribution
d) the residuals of the regression function must follow a normal distribution?

I'm a little confused, to be honest. Thanks in advance for your answers!

TFT
  • 133
  • Welcome to Cross Validated! Is this a homework problem or part of some [tag:self-study]? Please say what progress you have made. – Dave May 22 '22 at 17:59
  • @Dave Thanks! No, this is not a homework problem. Why are you asking that? Are people usually posting homework problems here? I just thought that enumerating the possible answers would make it easier to reply. Maybe none of the options I gave is correct. – TFT May 22 '22 at 18:03
  • 2
    The term "multivariate regression" is best reserved for a model with multiple outcome or response variables. It's often confused with multiple regression, which implies multiple predictors. The term "multiple" is itself losing point as (1) several predictors are used in most applications (2) regression can be explained directly as allowing one predictor or many. – Nick Cox May 22 '22 at 18:25
  • @NickCox Thank you for your clarification. I didn't know that but will use both terms as you explained from now on. – TFT May 22 '22 at 18:42
  • who makes up these assumptions? – Aksakal May 22 '22 at 22:51

1 Answers1

8

The first two are totally wrong but are common misconceptions about the normality assumption in OLS regression (when we choose to make such an assumption, which we don’t have to do).

There is no distribution assumption about the predictor variables, and there certainly is no normality assumption. For instance, ANOVA can be seen as an OLS regression, and ANOVA uses binary predictor variables, which certainly aren’t normal. Thus, A is false.

Since none of the features have to be normal, the features do not have to be jointly normal, and B is false.

C represents another common misconception. The marginal distribution of $y$ has no particular assumption. It is common to see people transform a non-normal $y$ to achieve marginal normality. While there are legitimate reasons for transforming $y$, it is far from a necessity.

D is the closest to a correct answer, but I dispute it on a technicality. The observed residuals are not normal. It is the unobserved error that is normal. The residuals are a discrete set that cannot be normal. However, many people use “residuals” and “errors” as synonyms. This slang tends not to cause problems in practice, but that’s only after we know the real definitions.

When the errors are independent and identical Gaussians, the OLS solution corresponds to maximum likelihood estimation of the regression parameters, and that allows us to do inference on the coefficients and on nested models in the usual way with t-stats and F-stats, respectively.

(The t-stats and F-stats wind up being pretty robust to violations of the normality assumption, particularly with large sample sizes.)

Dave
  • 62,186
  • 1
    Thank you so much for you super quick reply. Now it's perfectly clear to me. I am writing my bachelor thesis right now and got confused about the normality assumption. Thank you so much! – TFT May 22 '22 at 18:05
  • 2
    When the errors are independent and identical Gaussians <...> that allows us to do inference on the coefficients and on nested models may mislead some readers to think that inference on the coefficients and on nested models is not allowed when the errors are something else than i.i.d. Gaussian. – Richard Hardy May 22 '22 at 19:46