How should I interpret the assumption of the regression?

Question

I read an econometrics book which states one of the basic assumptions of regression is that

$$E(u|x) = 0$$

In another book however I see it written that

$$E(u_i|x_i) = 0$$

Are these two saying the same thing? More particularly, is it saying

$$E(U|X=?)$$

Or does it mean for observation $i^{th}$, the expected values of error term $i^{th} = 0$ (which doesn't seem to really make sense to me)?. How should I interpret it? On the other hand for the assumption which two error terms have an expected value of $0$, I wonder how this is calculated?

Yes, you are right: it means that the expected value of the error term is 0. In a concrete data set, you will virtually not see error=0, but you have negative and positive errors, and the mean of them is 0. Think of running the same experiments agsin and again in parallel worlds. — Ute, Sep 19 '23 at 07:34
Related: https://stats.stackexchange.com/questions/517000/clarification-on-the-assumptions-eux-0-and-the-x-i-being-fixed-in-repeate/517071#517071 — markowitz, Sep 19 '23 at 08:30

Ben · Answer 1 · 2023-09-21T23:10:56.093

The addition of subscripts on the variables is usually done to refer to particular data points in the regression. Reference to $u$ and $x$ are generic references to the error variable and explanatory variable, whereas reference to $u_i$ and $x_i$ are specific references to the error variable and explanatory variable for the $i$th data point.

Now, strictly speaking, all of the regression assumptions are conditional on the entire set of explanatory variables, so the actual conditioning should be on the entire design matrix. It is not sufficient to set the regression assumptions conditional on the explanatory variables for only one data point. Consequently, the actual "linearity" assumption in regression should look like this:

$$\mathbb{E}(u_i | \mathbf{x}) = 0 \quad \quad \quad \text{for all } i=1,...,n,$$

which means that when you condition on knowledge of all the explanatory variables in the regression, each of the error terms has zero expected value. For brevity, this assumption is sometimes written in generic form as in your first equation, but you should bear in mind that the relevant conditioning is always on all the explanatory variables.

Regarding your third equation, you should note that every conditional moment statement has a conditioning statement that relates to some random variable. This means that the first two equations must be referring to some explanatory variable/vector as the relevant random item for the conditioning statement. When the relevant random variable is not explicitly specified, as in your first two equations, this is because the author thinks that the correspondence is obvious from the notation (usually because the random variable at issue uses notation which is the upper-case version of the lower case conditioning value).

+1. Just want to add that when we do see "$E[u_i\mid x_i],$" there is a strong implication that the $u_i$ are independent, so one should look back through the text to see whether that assumption is explicit. — whuber, Sep 21 '23 at 18:50
@whuber: That's a good observation, but I'm not sure that this would even be sufficient (particularly if the conditioning for the independence is not specified). Even if $u_1,...,u_n$ are (marginally) independent, or even independent conditional on their individual explanatory variables, I think it is technically still possible to have $\mathbb{E}(u_i|x_i) = 0$ but also $\mathbb{E}(u_i|x_i,x_j) \neq 0$ (for $j \neq i$), which could still stuff up a regression. — Ben, Sep 21 '23 at 23:03
Ah thanks. I have been very confused about the notation of subscript. To me E(u_i|xi_i) should be not no different from the value of u_i since there is only one data point — Stephen Johson, Sep 24 '23 at 04:14

Shawn Hemelstrand · Answer 2 · 2023-09-21T14:36:50.323

Revised Answer

I originally deleted this answer because I didn't feel it was helpful, but I will leave it here in case your discussion with whuber will make the answer more clear (deleting the answer unfortunately wipes out all the comments with it, so I have simply marked out the portion that isn't helpful). Since I seem to be misconstruing the language used in this setting, I will leave it to better minds to provide an answer here.

Old Answer

Both are generally saying the same thing, however they have slightly different meanings behind them which may or may not be important.

$E(u|x)=0$ basically says that the error term, given $x$, should be expected to average to zero. This is the "general" case and doesn't have any specificity. The assumption fails if $x$ and $u$ are correlated.
$E(u_i|x_i)=0$ essentially says the same thing, but includes $i$ as an index indicating each $i$ observation. This may matter if you get into specific regression contexts, such as if you include time $t$ or cluster $j$ into the mix of the linear equation. Then your values may switch, for example, to $y_{ij}$, a $y$ response belonging to $i$ individual in $j$ group.
Usually $U$ and $X$ here mean the same thing (the vectors $u$ and $x$), they are just not indexed ($x_i$ signified an individual $x$ whereas $X$ just refers to the variable as a whole).

For your last question:

On the other hand for the assumption which two error terms have an expected value of $0$, I wonder how this is calculated?

Not sure what you mean, but I guess you are referring to error terms for both variables, which you would not have, as $x$ is fixed and known, whereas $y$ is random, hence the error term we usually stick onto a linear regression equation. Of course this is assuming that the $x$ here is measured with perfect precision, but that's a completely different topic.

Edit

Some of my answer may not have been very illustrative so I will just show you how this works in practice. Here I will use an example in R by fitting a regression and showing this in action.

#### Load Library ####
library(tidyverse)
library(broom)
Save Data as Tibble
cars <- mtcars %>% 
  as_tibble()
cars
Plot Points
cars %>% 
  ggplot(aes(x=wt,
             y=mpg))+
  geom_point()
Fit Model
fit <- lm(mpg ~ wt,cars)
resid(fit)

The residuals here, the $u$ here, is as we said not actually zero, but fluctuations around zero. This is why they are called expected errors, the $u_i$ individually and $u$ as a whole vector:

        1          2          3          4          5          6          7 
-2.2826106 -0.9197704 -2.0859521  1.2973499 -0.2001440 -0.6932545 -3.9053627 
         8          9         10         11         12         13         14 
 4.1637381  2.3499593  0.2998560 -1.1001440  0.8668731 -0.0502472 -1.8830236 
        15         16         17         18         19         20         21 
 1.1733496  2.1032876  5.9810744  6.8727113  1.7461954  6.4219792 -2.6110037 
        22         23         24         25         26         27         28 
-2.9725862 -3.7268663 -3.4623553  2.4643670  0.3564263  0.1520430  1.2010593 
        29         30         31         32 
-4.5431513 -2.7809399 -3.2053627 -1.0274952

However, if you take their mean or sum, you get something very close to zero if the model is not horribly specified (note this is in scientific notation, hence the e here:

> sum(resid(fit))
[1] 2.303713e-15

You can plot the distance of the fitted terms from the regression line to visualize the errors and how they sum to zero:

#### Show Residuals ####
cars %>% 
  lm(mpg ~ wt, data = .) %>% 
  augment() %>% 
  ggplot(aes(wt, mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = "purple") + 
  geom_segment(aes(xend = wt, yend = .fitted)) +
  labs(title = "Fitted Values and Distance from Line",
       x = "Weight of Car",
       y = "Miles per Gallon")

Shown here, where the black segments above the regression line are positive residuals and the segments below it are negative residuals:

I would have thought the expression "$E(u\mid x)$" refers to the vectors $u$ and $x;$ and perhaps that's what you're trying to say in the third bullet. Wherever you refer to "mean" and "average" it's unclear whether you intend that to be expectation or arithmetic mean of the observations. — whuber, Sep 19 '23 at 19:35
Thank you so much for your explanation, if you don't mind to further clarify my doubts. I wanna ask further question — Stephen Johson, Sep 20 '23 at 03:52
Assuming we have a table $$\begin{array}{c|c|c|c|} & \text{n-th obeservation} & \text{x_n}& \text{u_n} \ \hline \text{Row 1} & 1 & 3 & 2 \ \hline \text{Row 2} & 2 & 8 & 7 \ \hline \end{array}$$
then shouldn't the u_i corresponds to the third column instead of zero — Stephen Johson, Sep 20 '23 at 03:52
similarly for my third question i see notation of E(u_iu_j) = 0 if i does not equal to j, I wonder how to construct a similar table — Stephen Johson, Sep 20 '23 at 03:59
Stephen, that was the thrust of my questions: the expectation of a random variable used to model data does not, in general, equal the data or even the average of the data. For instance, the expected number of male children in a one-child family is close to $1/2$ but that expectation is about as far from any actual number of children as one can possibly get! — whuber, Sep 20 '23 at 12:36
what i am connfused with E(u_i| x_i) is that does it mean E(u|X =???) or E(u_i| x_i) where i = ??? becasue if it is the latter, then there only be one u_i value, then isn't the latter constitute a constant which corresponds to u_i — Stephen Johson, Sep 20 '23 at 13:58
I've edited the answer in hopes that it may be illustrative. — Shawn Hemelstrand, Sep 21 '23 at 00:07
I find this really confusing because the meaning of "expected" in "called expected errors" appears to be purely psychological: it certainly does not refer to the expectation operators the question is concerned about! — whuber, Sep 21 '23 at 12:22
It seems perhaps I am not equipped to answer this question based off your commentary, so I have marked out my original answer but undeleted it so you may continue your discussion with Stephen. I don't want to provide an inaccurate answer and it seems so far that is exactly what I am doing. — Shawn Hemelstrand, Sep 21 '23 at 14:38

How should I interpret the assumption of the regression?

2 Answers2

Revised Answer

Old Answer

Edit

Save Data as Tibble

Plot Points

Fit Model