Suppose that we are interested in the following model: $$y_i=\beta_1+\beta_2x_{i2}+\beta_3x_{i3}+u_i$$
Here, there is a dummy variable $d_i$.
I am wondering whether the following estimators are equivalent:
[OLS using only the observations with $d_i=1$] versus [OLS $d_iy_i$ on $d_i1,\;d_ix_{i2},\; d_ix_{i2}$]
That is, using the subset versus using the dummy-interacted variables.
When I run the both cases, the former yields a consistent estimator but the latter does not.
(even though the values of the dummy is random assignment)
the example R code is as follows:
x2 <- rnorm(100000, 2, 1)
x3 <- rnorm(100000, 1.5, 1)
x4 <- rbinom(100000, 1, 0.5)
y <- 1+2*x2+2*x3+rnorm(10000)
dt <- data.frame(y = y, x2 = x2, x3 = x3)
est <- lm(y~x2+x3, data= dt, subset = (x4 == 1))
summary(est)
nobs(est)
dt4 <- data.frame(y = yx4, x2 = x2x4, x3 = x3*x4, x4=x4)
est4 <- lm(y~x2+x3, data= dt4)
summary(est4)
nobs(est4)
Is there a way to do the same estimation without throwing away some of the data?
Why do they have different results?
Why the latter yields worse results despite the large number of observations?
{}tool on the toolbar. – EdM May 13 '22 at 13:49dt4actually contains, you will find a number of rows filled with zeros. That is what is causing the difference. – Richard Hardy May 13 '22 at 14:00