1

Suppose an independent variable that is influential in the data generating process for some dependent variable is omitted. The omitted variable has the following characteristics:

It is NOT correlated with any of the included independent variables

It is correlated with the dependent variable (implied by the fact that it is influential in the data generating process)

What effect do we expect this omission to have on the model? My intuition tells me that it should not significantly impact the coefficients of the included independent variables. Instead, the influence that would have been described by its coefficient is instead captured in the error term.

Is my understanding correct?

Can this be proven?

Does the answer differ if considering bayesian regression instead of classical regression?

Patrick C
  • 11
  • 1

1 Answers1

1

You are correct, the omitted variable explains the dependent and thus makes the unexplained part of your model "smaller" after you would include the "new" variable in your equation, or: after including, the (estimate) of the standard error of regression $\hat{\sigma}$ would be smaller, as there is less variance in your data left unexplained. And this would have a consequence for the standard errors of the (estimates of) the regression coefficients of the "old" already included $X$ variables, because it can be proven that their squared std. errors are on the diagonal of the matrix $\hat{\sigma}^2(X'X)^{-1}$. So, if $\hat{\sigma}^2$ is smaller, the diagonal terms will be smaller, meaning the estimates of the regression coefficients will more easily be significant.

Further, adding a independent variable which is uncorrelated with the already included variables, does not change the regression coefficients of the latter. This can intuitively be explained with a graph:

enter image description here

The black line is the estimated regr. line when only X is in the regression equation (1) below:

$Y = b_0 + b_1X + error_{black}$      (1)

When adding Gender, we have the equation (2):

$Y = b^*_0 + b_1X + b_2Gender + error_{colored}$      (2)

and now there are two predicted regression lines, one for each gender, the blue and the red line which are parallel meaning they have the same slope $b_1$. But these two colored lines are also parallel to the black line, which also has slope $b_1$! This is, because Gender and X are completely uncorrelated, their correlation coefficient being exactly 0! So, whether or not we "control for" Gender, is not relevant for the slope of the regression line of X: given the value of Gender (controlling for Gender, that is) the slope of the (two) colored lines is the same as the slope of the uncontrolled black line.

More interesting is: the distances from the blue and red dots to the black line are much larger than the distances to the blue and red line, respectively. Meaning: the sum of the squared error terms is smaller for the "colored" model (2) than for the "black" model (1).

No look at the following graph:

enter image description here

The red cloud and line have been moved seven X-units to the right, with the Y values left unchanged. You can see that for Gender=0 the X values are higher than for Gender=1, meaning Gender and X are correlated, there correlation being -0.77. This negative relation is also visible in the negative slope of the black regression line for the whole cloud. However, the two colored regression lines stil have $b_1$ as their slope, like in the earlier graph. So, adding Gender here leads to a very different slope of X!

BenP
  • 1,124
  • This is correct for ordinary least squares but not for logistic regression or other models that have noncollapsibility. See this page for an explicit proof in the case of a probit model. For such models, omitting a predictor associated with outcome but uncorrelated with included predictors will lead to coefficient estimates that are biased toward 0 for the included predictors. – EdM Mar 22 '24 at 19:30