Regression coefficients do not match conditional means

Question

In a nutshell, I want the regression coefficients of a model to match several differences in conditional means.

You can download the data from this repo.

I have a data set that has a dependent variable (Y) and three binary columns (T, X1 and X2).

	Y	CONST	T	X1	X1T	X2	X2T
0	2.31252	1	1	0	0	1	1
1	-0.836074	1	1	1	1	1	1
2	-0.797183	1	0	0	0	1	0

I want to calculate the difference in the mean of Y for observations with T == 1 and those with T == 0 for each of the four possible combinations of X1 and X2:

Mean difference given X1 == 0 and X2 == 0
Mean difference given X1 == 0 and X2 == 1
Mean difference given X1 == 1 and X2 == 0
Mean difference given X1 == 1 and X2 == 1

I did this manually, but I cannot get the following model to match my results: $$Y = \beta_0 + \beta_1 T + \beta_2 X_1 + \beta_3 X_1 T + \beta_4 X_2 + \beta_5 X_2 T + U$$

As per this post:

$\hat{\beta_1}$ should match case 1
$\hat{\beta_1} + \hat{\beta_5}$ should match case 2
$\hat{\beta_1} + \hat{\beta_3}$ should match case 3
$\hat{\beta_1} + \hat{\beta_3} + \hat{\beta_5}$ should match case 4

As can be seen in this jupyter notebook, I cannot get these two methods to match.

How come the linear regression results do not match the differences in conditional means?

Hi: I don't use python but, in R, my guess would be that the default contrasts that are outputted are not the ones you are looking for. There are various types of contrasts ( helmert, etc ). Maybe they can be specified in your regression call like they are in R. — mlofton, Apr 30 '22 at 10:26
Hi! I am not familiar with this. Could you elaborate? I'm comfortable with R, so either language is fine. — Arturo Sbr, Apr 30 '22 at 12:22
Do you have John Fox's text ? I think it has the words applied regression in it. It explains contrasts quite nicely. I can't do it justice but the idea is that there are different kinds of contrasts so which one R is outputting might not be the one you are interested in. I don't know what the bibles are these days as far as applied statistics-regression but I highly recommend John's text. I'll send you a link in a follow up comment. — mlofton, May 01 '22 at 03:40
It looks like it's been updated since the last time I looked at it which was close to ten years ago. I remember using for both understanding contrasts and understanding generalized linear models. https://www.amazon.com/Applied-Regression-Analysis-Generalized-Linear/dp/1452205663/ref=sr_1_2?crid=3HQV7FNJR6S27&keywords=john+fox+applied+regression&qid=1651376474&sprefix=john+fox+applied+regression%2Caps%2C77&sr=8-2 — mlofton, May 01 '22 at 03:43

dipetkov · Accepted Answer · 2022-05-01T07:07:26.913

It's not straightforward to estimate effects by hand. The math works out nicely only in "simple" cases.

One simple case is a saturated model. A saturated model includes all main effects and possible interactions, so it has a parameter for each unique combination of the predictors.

Your model is not saturated; it's missing a three-way interaction T * X1 * X2 as well as a two-way interaction between the covariates X1 * X2. By omitting interactions you impose (implicit) constraints on the remaining parameters, so the estimate of $\beta_1$ is a function of not only the sample means for groups {T=1, X1=0, X2=0} and {T=0, X1=0, X2=0} but all group means.

This is easier to understand in a model with one covariate in addition to the treatment:

$$ Y = \beta_0 + \beta_1 T + \beta_2 X $$

In this model $\beta_1$ is the treatment effect for both group X = 0 and group X = 1 as omitting the T * X interaction imposes the constraint that the two groups have the same treatment effect. To estimate $\beta_1$ the regression takes into account all observations, not only those with X = 0. Your model is more complex, so the implicit constraints are harder to conceptualize. (For example, combine cases 1&4 and cases 2&3 to show that the average treatment effect is the same when X1 = X2 and when X1 ≠ X2.)

My original answer considered a different special case altogether. I include it for completeness.

The sample means for the treatment and control groups (t = 1 and t = 0, respectively) are the same as the marginal means (the population means) from a linear regression adjusted for covariates only if the data is balanced. A balanced dataset has the same number of observations for each unique combination of the predictors.

set.seed(1234)
n <- 50
y <- rnorm(2 * n)
balanced data
t <- rep(c(0, 1), each = n)
x <- rep(c(0, 1), times = n)
table(t, x)
#>    x
#> t    0  1
#>   0 25 25
#>   1 25 25
coef(lm(y ~ t + x))["t"]
#>         t 
#> 0.5925825
mean(y[t == 1]) - mean(y[t == 0])
#> [1] 0.5925825
imbalanced data
t <- sample(c(0, 1), 2 * n, replace = TRUE)
x <- sample(c(0, 1), 2 * n, replace = TRUE)
table(t, x)
#>    x
#> t    0  1
#>   0 22 33
#>   1 26 19
coef(lm(y ~ t + x))["t"]
#>         t 
#> 0.1194514
mean(y[t == 1]) - mean(y[t == 0])
#> [1] 0.1402377

I had not considered this! Have an upvote! However, my problem was due to the omission of two additional parameters (corresponding to two additional interactions of the independent variables) in the regression model. — Arturo Sbr, Apr 30 '22 at 21:36
Yes, my first answer was wrong or at least fixated on another special case. — dipetkov, Apr 30 '22 at 22:06

score 2 · Answer 2 · answered Apr 30 '22 at 23:32

This is related to an answer in this question. Why is the intercept in multiple regression changing when including/excluding regressors?

In this image you see how a fitted curve does not have to correspond to the actual conditional means. While the data points are spread around 30 for $x=0$, the intercept from the model does not equal 30.

The differences occur because of random variations and bias.

score 1 · Answer 3 · answered Apr 30 '22 at 21:34

The first comment form this post helped me understand my problem.

There are eight conditional means, from which I calculated four differences.

My regression model had six parameters. In order for the regression coefficients to match the differences in conditional means, I had to add two additional interaction terms: $X_1 \times X_2$ and $X_1 \times X_2 \times T$.

That is: $$Y = \beta_0 + \beta_1 T + \beta_2 X_1 + \beta_3 X_1 T + \beta_4 X_2 + \beta_5 X_2 T + \beta_6 X_1 X_2 + \beta_7 X_1 X_2 T + \varepsilon$$

Regression coefficients do not match conditional means

3 Answers3

balanced data

imbalanced data