1

This is something which has always confused me.

Suppose we take a standard statistical regression model:

\begin{equation} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{equation}

where:

  • $Y_i$ is the response variable
  • $X_i$ is the predictor variable
  • $\beta_0$ and $\beta_1$ are the coefficients
  • $\epsilon_i$ is the error term.

The error term $\epsilon_i$ are assumed to follow a normal distribution with mean 0 and constant variance $\sigma^2$, i.e.,

\begin{equation} \epsilon_i \sim N(0, \sigma^2) \end{equation}

Given this information, we can find out the Expected Value of $Y_i$:

\begin{align*} E(Y_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= E(\beta_0) + E(\beta_1 X_i) + E(\epsilon_i) \\ &= \beta_0 + \beta_1 E(X_i) + 0 \\ &= \beta_0 + \beta_1 X_i \end{align*}

Given this information, we can also find out the Variance of $Y_i$:

\begin{align*} Var(Y_i) &= Var(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= Var(\beta_0) + Var(\beta_1 X_i) + Var(\epsilon_i) \\ &= 0 + \beta_1^2 Var(X_i) + \sigma^2 \\ &= 0 +0 + \sigma^2 \\ &= \sigma^2 \\ \end{align*}

Using the mathematical properties of the Normal Distribution (i.e. linear combinations of Normal Distributions are also Normally Distributed), we usually write the following:

\begin{equation} P(Y_i) \sim N(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2) \end{equation}

My Question: When I look at the above formula, I always think that this is referring to the Marginal Probability Distribution of $Y_i$. But since this distribution depends on ${x}_i^T $ , is this not actually the Conditional Distribution?

Thus, I think it would be more correct to write the above formula as:

\begin{equation} P(Y_i | \mathbf{x}_i) \sim N(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2) \end{equation}

Can someone please tell me if I am correct?

Thanks!

  • Note: Assuming the analysis I provided here is correct - this analysis can serve to clarify one of the common misconceptions in Probability: In a standard regression model - $Y_i$ by itself does NOT need to have a Normal Distribution! But rather, $Y_i | \mathbf{x}_i$ needs to be Normally Distributed!

  • This means that if I plot a histogram of the response variable (i.e. $Y_i$) and it does not look Normally Distributed - this in and of itself is not a solid indication that a standard regression model is not suitable. Rather, we need to study the conditional distribution $P(Y_i | \mathbf{x}_i)$ before making such conclusions.

User1865345
  • 8,202
stats_noob
  • 1
  • 3
  • 32
  • 105
  • In your very first formula, the right hand side includes $X_i.$ It therefore either is an expectation parameterized by $X_i$ or it is conditional on $X_i.$ To avoid confusion, your notation on the left ought to reflect that. – whuber Jul 23 '23 at 17:02
  • This was a question that I read prior to posting my question. My question is more about how to correctly write the notation i.e. P(Y) vs P(Y given X) . But I think I understand it now. Thank you so much! – stats_noob Jul 23 '23 at 17:08
  • Perhaps, then, it is worth reopening this to allow you to close this out with a self-answer about the notation. Note that self-answers are less common than answers by others but are completely within the etiquette of Cross Validated. I have asked questions on here only to discover the answer and post a self-answer at a later time. – Dave Jul 23 '23 at 17:12
  • @Whuber: just to clarify : is the correct notation: E(Y given X = xi) = b0 + b1*xi ? – stats_noob Jul 23 '23 at 20:11
  • You may use any notation you wish, provided it's sufficiently clear and that you define it. A conventional notation for conditional probabilities would be $\Pr(Y_i\mid x_i)$ while a notation for parameterized probabilities would be $\Pr(Y_i;x_i).$ But many people don't even distinguish the two concepts, despite the fact that in the first instance $x_i$ is a random variable and in the second instance it is not. – whuber Jul 24 '23 at 11:42

0 Answers0