2

I wish to run a version of the following regression: $$ w_{i}=\beta_{1}HS_{i}+\beta_{2}College_{i}+\epsilon_{i} $$ In the above, $HS_{i}$ is a indicator variable that takes on the value of $1$ if the individual is High school educated, and 0 if college educated. $College_{i}$ takes on a value of 1 if the individual is college educated and 0 otherwise. $w_{it}$ represents wages. An OLS regression produces $\hat{\beta}_{1}$ and $\hat{\beta}_{2}$ as conditional means of wages given an individual is high school educated and college educated respectively.

However, imagine that when an individual is college educated, \emph{both }$HS_{i}$ and $College_{i}$ take on a value of 1. For instance, an individual that is college educated would necessarily be high school educated as well. In that sense, if $College=1,$ then~$HS$ would necessarily have to be $1$ as well. Is this conceptually identical to the first model above?

EDIT: I simulated some data, and I found the two regressions to be statistically equivalent in fit. In particular, in the first case, we have that: $$ \hat{\beta}_{1}=\mathbb{E}\left(w_{i}|HS_{i}=1\right) $$ and $$ \hat{\beta}_{2}=\mathbb{E}\left(w_{i}|college_{i}=1\right) $$ In the second case, we still have that $$ \hat{\beta}_{1}=\mathbb{E}\left(w_{i}|HS_{i}=1,college_{i}=0\right) $$ but $$ \hat{\beta}_{2}=\mathbb{E}\left(w_{i}|HS_{i}=1,college_{i}=1\right)-\mathbb{E}\left(w_{i}|HS_{i}=1,college_{i}=0\right) $$ or $$ \hat{\beta}_{2}=\mathbb{E}\left(w_{i}|HS_{i}=1,college_{i}=1\right)-\hat{\beta}_{1} $$

However, what boggles the mind is how the effect on college is being identified? The variable college always takes on a value of 1 (there is no variation in it!). Any guidance is much appreciated.

  • 1
    I'm not sure it does. I am looking for a mathematical explanation of the equivalence (or lack of) in the two models. I simulated some data, and found that the model fit (RSS, ESS etc) was independent of the coding type, but the coefficients on high school are different for instance depending on how I code it. – Kwame Brown Oct 06 '22 at 11:13
  • 1
    If I understand correctly, in the second approach, one of the two indicators simply is the constant term. – Christoph Hanck Oct 06 '22 at 12:14
  • @ChristophHanck That was a tremendous oversight on my part! Thank you so much . However, is the problem a more general one? In other words, imagine that not every student is college or high school educated (and that is captured in the constant). But, whenever college=1, HS necessarily is 1. Is that model then equivalent to one which codes college=1 and HS=0? – Kwame Brown Oct 06 '22 at 12:17
  • If you have more than two categories, then you should indeed have a constant and two dummies or three dummies. – Christoph Hanck Oct 06 '22 at 13:14
  • @ChristophHanck Understood. The question , however, is whether or not a model wherein college=1 and HS=1 when a student is college educated is fundamentally different from one where college=1 and HS=0 in terms of estimation. – Kwame Brown Oct 06 '22 at 13:27
  • There's no difference at all. In fact, provided you code the two categories with distinct numerical values--use $-\pi$ and $\exp(1)$ if you wish--you will still get the same estimates. – whuber Oct 06 '22 at 14:19

0 Answers0