In linear regression, do the errors overall have a normal distribution, or do the errors at each value of x have a normal distribution?

Question

In linear regression with fixed effects (i.e. with constant $x_i$, not random $X_i$), the model states that

$\epsilon_i \sim N(0, \sigma^2), \ \ i = 1, 2, ..., n$

Does this say that

the overall set of errors have a normal distribution?

or

the errors at each value of $x_i$ have a normal distribution?

I think that the answer is #2. I know that the responses ($Y_i$) are normal at each value of $x_i$, not overall. Therefore, it only seems right that the errors are normal at each value of $x_i$, too.

I'm not sure about this. When I look at the answer to this question, I think that #1 may be right.

It is important to understand mathematically what #1 even means. — usul, Mar 23 '24 at 20:06
A pedantic note, but "fixed effects" does not mean the same as "fixed regressors". You seem to refer to the latter. — Durden, Mar 24 '24 at 16:06
Fitting a model using least squares involves no distributional assumptions, it is a pure exercise in calculus. The resulting residuals can be tested for normality, and if that's plausible you can go on to use statistical inference to assess significance, but while assuming i.i.d. normality is great for writing statistics texts it's not a given for real-world data and it's not a necessity for doing least squares regression. — pjs, Mar 24 '24 at 17:16

score 8 · Answer 1 · answered Mar 23 '24 at 00:33

8

You are correct that this notation addresses each individual error term. Given an $i$, $\varepsilon_i$ is normal.

However, with an added assumption of independence of the error terms, the errors are jointly multivariate normal.

answered Mar 23 '24 at 00:33

Dave

62,186

BenP · Answer 2 · 2024-03-23T17:52:06.323

The second answer is one of the assumptions made in linear regression. This assumption is typically written as

$\epsilon|X \sim~ N(0,\sigma^2I)$

where $\epsilon|X$ denotes the error given the values of the X variables. See e.g. William Green, Econometric Analysis, Fourth Edition page 222. It is not necessarily true that a normal distribution of all error terms considered together would guarantee that normality also holds for each separate X value! To give an example, suppose there is only one X variable with two values, 0 and 1.

For $X=0$ the error terms come from a truncated standard normal with lower bound -1 and upper bound +1, which could be written as "truncnorm(-1,+1)". So the mean of these error terms is zero.

For $X=1$, half of the error terms come from truncnorm(-Infinity, -1) and the other half comes from truncnorm(+1, +Infinity). The mean of these error terms is also zero.

So the error terms of both X values do not have the same distribution and variance for both X values, but that is not the issue of this question.

The point is that the entire set of errors does have a normal distribution, whereas for each X value this is NOT true.

Here is an R script which generates data as in this example.

library(truncnorm)
library(ggplot2)
set.seed(12345)
n <- 10000
e0 <- rtruncnorm(0.68n, a=-1, b=1)
e1 <- c(rtruncnorm(0.16n, a=-Inf, b=-1), rtruncnorm(0.16*n, a=1, b=Inf))
e <- c(e0, e1)
x <- c(rep(0,0.68n), rep(1,0.32n))
y <- 1 + x + e
model <- lm(y ~ x)
resid <- residuals(model)
hist(resid, breaks=50)

The histogram of the estimated errors (residuals) is:

For $X=0$ we get:

hist(resid[x==0], breaks=20)

For $X=1$ we get:

hist(resid[x==1], breaks=20)

The idea that the errors for each separate X value should be normally distributed is sometimes graphically shown as follows:

For each separate (combination of) X value(s) there is one and the same normal distribution from which the data are randomly and independently drawn. That is the idea behind "ordinary" regression. So, it is not enough to formulate the condition as: "across all cases the errors should be normally distributed".

score 3 · Answer 3 · answered Mar 23 '24 at 00:42

I would say that both answers are correct, but I would interpret it as statement #1 and #2 is perhaps not fully "appropriate".

Note that $y_i \sim N(x_i'\beta, \sigma^2)$ for some fixed $x_i$ and this indeed holds at each value $x_i$, as you correctly point out. On the other side, $\epsilon_i \sim N(0, \sigma^2)$ which does not depend on $x_i$. Here, $x_i$ is fixed so it is independent of $\epsilon_i$. Therefore, you are right the errors at each value of $x_i$ have a normal distribution, but this is because this holds in general and therefore also for any value of $x_i$. In that sense, it is "superfluous" to say that the normality holds for each value of $x_i$. For $y_i$, this is not the case, as this actually depends on $x_i$.

Sorry, I don't follow. How do you know that "the errors at each value of xi have a normal distribution"? — Iterator516, Mar 23 '24 at 02:54
The errors are normally distributed. This is what the equation above says. Moreover, they are independent of $x_i$. Thus, this holds for every value of $x_i$. Think about the equation 1+1=2, this holds in general. This is independent of the value of $x_i$. As 1+1=2 holds in general, it also holds for every value of $x_i$. — Stan, Mar 23 '24 at 09:09

In linear regression, do the errors overall have a normal distribution, or do the errors at each value of x have a normal distribution?

3 Answers3