OLS estimator question: using a subset versus using a dummy-interacted variables

Question

Suppose that we are interested in the following model: $$y_i=\beta_1+\beta_2x_{i2}+\beta_3x_{i3}+u_i$$

Here, there is a dummy variable $d_i$.

I am wondering whether the following estimators are equivalent:

[OLS using only the observations with $d_i=1$] versus [OLS $d_iy_i$ on $d_i1,\;d_ix_{i2},\; d_ix_{i2}$]

That is, using the subset versus using the dummy-interacted variables.

When I run the both cases, the former yields a consistent estimator but the latter does not.

(even though the values of the dummy is random assignment)

the example R code is as follows:

x2 <- rnorm(100000, 2, 1)
x3 <- rnorm(100000, 1.5, 1)
x4 <- rbinom(100000, 1, 0.5)
y <- 1+2*x2+2*x3+rnorm(10000)
dt <- data.frame(y = y, x2 = x2, x3 = x3)
est <- lm(y~x2+x3, data= dt, subset = (x4 == 1))
summary(est)
nobs(est)
dt4 <- data.frame(y = yx4, x2 = x2x4, x3 = x3*x4, x4=x4)
est4 <- lm(y~x2+x3, data= dt4)
summary(est4)
nobs(est4)

Is there a way to do the same estimation without throwing away some of the data?

Why do they have different results?

Why the latter yields worse results despite the large number of observations?

Please provide a reproducible example so that we can see what's going on. Please do that via the code {} tool on the toolbar. — EdM, May 13 '22 at 13:49
Why would you expect the two approaches return similar results? If you look at what dt4 actually contains, you will find a number of rows filled with zeros. That is what is causing the difference. — Richard Hardy, May 13 '22 at 14:00
@RichardHardy The reason is that in a textbook, I found a sentence that seems to imply the equivalence of the two estimations. But, I also think that there is no specific reason for that. — M.C. Park, May 13 '22 at 15:32
I see. The textbook may contain a poorly formulated statement, or you might have misread it. — Richard Hardy, May 13 '22 at 15:40
There are many similar questions here, see the links at https://stats.stackexchange.com/questions/574854/separating-datasets-vs-one-dataset-with-extra-categorical-feature#comment1061323_574854 — kjetil b halvorsen, May 13 '22 at 15:47

score 1 · Answer 1 · answered May 13 '22 at 15:11

The way that you structured df4 you effectively only included interaction (product) terms between the binary x4 and the original x2 and x3 predictors, while omitting the "main effects." That's generally poor practice, except in very limited circumstances. See this page for extensive discussion. If you structure that regression properly, e.g.:

est2 <- lm(y~(x2+x3)*x4, data= dt)

then all will make sense.

score 0 · Accepted Answer · answered May 19 '22 at 16:37

The answer: The two regression should be the same because $$[\sum_{i:d_i=1} x_ix_i']^{-1}[\sum_{i:d_i=1}x_iy_i]$$ using only the subsample with $d_i=1$ is equivalent to $$[\sum_{i=1}^N x_id_ix_i']^{-1}[\sum_{i=1}^Nx_id_iy_i]$$. The reason for the different results from the R code above is that the lm function do not make the interaction in the "intercept term".

see the following code and results is the same

one <- rep(1, 1000)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
d <- sample(c(1,0), 1000,replace = T)
u <- rnorm(1000)
y <- one+2*x1+x2+u
using subsample
dt <- data.frame(y=y,x1=x1,x2=x2,d=d)
est1 <- lm(y~x1+x2, data=dt, subset = (d==1))
using interaction
X <- d*cbind(one,x1,x2)
results
est1$coefficients
solve(t(X)%%X)%%(t(X)%%y)
> est1$coefficients
(Intercept)          x1          x2 
  1.0077684   2.0358979   0.9949592 
> solve(t(X)%%X)%%(t(X)%%y)
         [,1]
one 0.9547948
x1  2.0125214
x2  0.9677016
````

OLS estimator question: using a subset versus using a dummy-interacted variables

2 Answers2

using subsample

using interaction

results