I am having a bit of trouble understanding endogeneity.
If I have a regression specification where the "true" model looks like this:
$y = \beta_1 x + \beta_2 z + \epsilon $
And we lack data on $z$, so we run:
$y = \beta_1 x + \epsilon $
What confuses me is that this is only problematic when $ x $ and $ z $ are correlated. Well, let's say $ x $ is a function of $ z $, $ x = f(z) + \delta $ where the error term has mean zero, no correlation with anything else, and assume that we do have data on $ z $. Then, would running the regression:
$y = \beta_1 x + \beta_2 z + \epsilon $
Mean that I no longer have an endogeneity problem, and now the only problem in my model is multicollinearity if $ x $ and $ z $ are highly correlated? Further, would I just be able to run the two-stage regression:
$\hat x = \gamma_1 z + \gamma $
$y = \beta_1 \hat x + \beta_2 z + \epsilon $
Without worrying about it? $\hat x $ and $z$ would be correlated with each other but that should not be an issue. The idea I want to explore is that a variable impacts both another dependent variable and the independent variable, and I want to try to capture both effects. My thinking is that the two-stage approach I outlined above is incorrect, and the IV approach is appropriate here and so I should look for an instrument that is uncorrelated with $ x $ but is correlated with $ z $ in the first stage. Then, if I know the impact $ z $ has on $ x $, and the impact $ x $ has on $ y $, that is one pathway. Further, by instrumenting $ x $, I can find the impact $ x $ has on $ y $ outside of $ z $. Is this thinking correct?