This is something which has always confused me.
Suppose we take a standard statistical regression model:
\begin{equation} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{equation}
where:
- $Y_i$ is the response variable
- $X_i$ is the predictor variable
- $\beta_0$ and $\beta_1$ are the coefficients
- $\epsilon_i$ is the error term.
The error term $\epsilon_i$ are assumed to follow a normal distribution with mean 0 and constant variance $\sigma^2$, i.e.,
\begin{equation} \epsilon_i \sim N(0, \sigma^2) \end{equation}
Given this information, we can find out the Expected Value of $Y_i$:
\begin{align*} E(Y_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= E(\beta_0) + E(\beta_1 X_i) + E(\epsilon_i) \\ &= \beta_0 + \beta_1 E(X_i) + 0 \\ &= \beta_0 + \beta_1 X_i \end{align*}
Given this information, we can also find out the Variance of $Y_i$:
\begin{align*} Var(Y_i) &= Var(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= Var(\beta_0) + Var(\beta_1 X_i) + Var(\epsilon_i) \\ &= 0 + \beta_1^2 Var(X_i) + \sigma^2 \\ &= 0 +0 + \sigma^2 \\ &= \sigma^2 \\ \end{align*}
Using the mathematical properties of the Normal Distribution (i.e. linear combinations of Normal Distributions are also Normally Distributed), we usually write the following:
\begin{equation} P(Y_i) \sim N(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2) \end{equation}
My Question: When I look at the above formula, I always think that this is referring to the Marginal Probability Distribution of $Y_i$. But since this distribution depends on ${x}_i^T $ , is this not actually the Conditional Distribution?
Thus, I think it would be more correct to write the above formula as:
\begin{equation} P(Y_i | \mathbf{x}_i) \sim N(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2) \end{equation}
Can someone please tell me if I am correct?
Thanks!
Note: Assuming the analysis I provided here is correct - this analysis can serve to clarify one of the common misconceptions in Probability: In a standard regression model - $Y_i$ by itself does NOT need to have a Normal Distribution! But rather, $Y_i | \mathbf{x}_i$ needs to be Normally Distributed!
This means that if I plot a histogram of the response variable (i.e. $Y_i$) and it does not look Normally Distributed - this in and of itself is not a solid indication that a standard regression model is not suitable. Rather, we need to study the conditional distribution $P(Y_i | \mathbf{x}_i)$ before making such conclusions.