0

This is the formalization for a continuous regression model as I understand it:

Assume that your outcome $Y$ is normally distributed. Assume that your predictors, $X_1, X_2, \cdots X_n$ are normally distributed and uncorrelated.

We then model $$ Y \sim a_1 X_1 + a_2 X_2 + \cdots a_n X_n + \epsilon $$ with $\epsilon \sim N(0, \sigma_\epsilon$). Since the sum of normal random variables is normal, $Y$ will be normal.

However, most regressions also include categorical variables. These are encoded as zero one vectors. In the simple case of a binary categorical that takes on one of two values, I can model it as a Bernoulli random variable, $B \sim \text{Ber}(p)$.

But when I write $$ Y \sim aX + bB + \epsilon $$

then $Y$ is no longer normally distributed. In fact according to this question, it appears that the $Y$ is a mixture of Gaussians.

I am confused because I thought that one of the assumptions of regression is that all of your observations (independent and dependent) are normally distributed. Yet, the above formulation suggests that $Y$ does not need to be normal.

I have two questions. First, does this result mean that a regression with categorical variables is a different way of fitting a mixture of gaussians as suggested in this question?

Second, how could I include non-binary categorical variables in the above formalization? What distribution would I use?

  • Welcome to Cross Validated! I believe my linked answer addresses your question. If any questions remain, please do comment. A closed post can be reopened if it turns out not to be a duplicate (which comments here might reveal to be the case). – Dave Apr 26 '23 at 19:45
  • Specifically, your first question is answered by noting that the regression assumptions concerns the conditional distributions of the Y variable, not the marginal distribution. – BigBendRegion Apr 27 '23 at 11:23

0 Answers0