Example showing equivalence between stratification and regression

Question

One way of presenting the intuition behind using controls in a regression to estimate the effect of a treatment on an outcome is in terms of stratification. When one controls for Z, one is looking at the relationship between X and Y among observations with similar levels of Z, as here

The easiest, and one you came up with, is to stratify your data so you have sub-groups with similar characteristics - there are then methods to pool those results together to get a single "answer". This works if you have a very small number of variables you want to control for, but as you've rightly discovered, this rapidly falls apart as you split your data into smaller and smaller chunks... A more common approach is to include the variables you want to control for in a regression model... The estimate you will get... will be the effect of Impatience within levels of the other covariates - regression allows you to essentially smooth over places where you don't have much data (the problem with the stratification approach), though this should be done with caution.

I've been struggling to devise an example in R, using simulated data, that shows this equivalence between stratification and regression. That is, that you could get the same answer, either controlling for Z, or fitting a model between X and Y only among Z = 0 or Z = 1 (an added problem: this yields two estimates; how to combine?)

Thanks in advance.

They're not exactly equivalent. See, inter alia, https://stats.stackexchange.com/questions/13112. But when $Z$ truly is binary (and not just a grouping of an original variable in terms of "similar levels"), then you simply include an interaction term of $Z$ with all the other explanatory variables: you can find very many examples of that here on CV! Your last question is puzzling: what exactly would you be combining? — whuber, Jan 29 '24 at 21:05
@whuber, they mean: how would you combine the beta hat for x given z=0 & beta hat for x given z=1 to get the beta hat for x you would get if you controlled for z as a covariate. — gung - Reinstate Monica, Jan 29 '24 at 21:08

Thomas Lumley · Accepted Answer · 2024-01-30T22:52:12.730

I will look at something a bit more general: $Y$ is continuous, $X$ is binary, and $Z$ has $K$ levels. (update: I don't actually use the assumption that $X$ is binary, but (a) I didn't know I wouldn't and (b) it avoids questions about correct model specification)

Start with what I will call the adjusted model $$E[Y]=\beta X+\sum_{k=1}^K \gamma_k I(Z=k)$$

By a standard result for linear models, the Frisch-Waugh-Lovell theorem, you can fit this model in two steps. First, regress $Y$ on $Z$ and $X$ on $Z$, then regress the residuals from the $Y$ model on the residuals from the $X$ model.

The residuals $r_Y$ from the $Y$ model are $Y$ centred at zero for each level of $Z$; ie, $Y$ minus the mean of $Y$ for the corresponding level of $Z$. The residuals $r_X$ from the $X$ model are $X$ centered at each level of $Z$. Econometricians would call the regression of $r_Y$ on $r_X$ the within estimator since it only looks at the relationship of $X$ and $Y$ within levels of $Z$.

We have an explicit form for the within estimator of $\hat\beta$: $$\hat\beta = \frac{\mathrm{cov}(r_Y,r_X) }{\mathrm{var}(r_X)}$$

Next, consider the stratified models $E[Y]=\alpha_k+\beta_k X$ fitted. If we center $Y$ and $X$ in the stratified models we will change $\alpha_k$ (to zero) but will not change $\beta_k$. We have, within stratum $k$ $$\hat\beta_k=\frac{\mathrm{cov}(r_Y,r_X) }{\mathrm{var}(r_X)}$$

Finally, combining $\hat\beta_k$. We want a precision-weighted average. That is, we want to weight the stratum-specific $\beta$ by the reciprocal of its variance. Here is the only place where we cheat: the variance of $\hat\beta_k$ is actually $\sigma^2_k/\mathrm{var}(r_X)$, where $\sigma^2_k$ is the residual variance in stratum $k$. We will actually use variances proportional to $1/((n_k-1)\mathrm{var}(r_X))$, where $n_k$ is the stratum size, equivalent to assuming that the residual variance is the same in each stratum.

Write $w_k=(n_k-1)\mathrm{var}(r_X)$ as the precision weights. We have $$\sum_k w_k\hat\beta_k = \sum_k (n_k-1)\mathrm{cov}(r_X,r_Y)$$ and in the denominator $$\sum_k w_k = \sum_k (n_k-1)\mathrm{var}(r_X)$$ and, lo and behold, $$\frac{\sum_k w_k\hat\beta_k}{\sum_k w_k}=\frac{\sum_k \mathrm{cov}(r_X,r_Y}{\sum_k \mathrm{var}(r_X)}$$ the adjusted $\hat\beta$ is the same as the within-stratum $\hat\beta$ is the same as the precision-weighted stratified $\hat\beta$.

Now, an example

This is an old biostatistics dataset, looking at lung function ($FEV_1$) and smoking in some children in Boston in the mid-1970s. We're interested in the relationship with smoking

> fev<-read.table("https://raw.githubusercontent.com/GTPB/PSLS20/master/data/fev.txt", header=TRUE)
> lm(fev~smoking,data=fev)
Call:
lm(formula = fev ~ smoking, data = fev)
Coefficients:
(Intercept)      smoking

     2.6346       0.6054

That's higher FEV1 in kids who smoke, which is ... surprising?

The answer is confounding by age

plot(fev~jitter(age), data=fev,pch=19,
   col= ifelse(smoking==1,"purple","orange"))

We might do an adjusted analysis

> lm(fev~smoking+factor(age),data=fev)
Call:
lm(formula = fev ~ smoking + factor(age), data = fev)
Coefficients:
  (Intercept)        smoking   factor(age)7   factor(age)8   factor(age)9

       1.6580        -0.1757         0.2115         0.4573         0.7761

factor(age)10  factor(age)11  factor(age)12  factor(age)13  factor(age)14

       1.0430         1.3989         1.5870         1.8784         1.9735

factor(age)15  factor(age)16  factor(age)17

       1.9193         2.1055         2.6824

or a stratified analysis

> betas<-sapply(unique(fev$age), function(age_i) coef(lm(fev~smoking, subset=age==age_i,data=fev))["smoking"])
> 
> weights<-sapply(unique(fev$age), function(age_i) var(fev$smoking[fev$age==age_i])*(sum(fev$age==age_i)-1))
> 
> weighted.mean(betas,weights)
[1] -0.1756593

Note that this works even though some of the strata don't have enough smokers to estimate $\beta_k$. They don't in the adjusted model, either, and they get a weight of zero.

In this example, the confounding relationship is quite close to being linear and additive, so you a point estimate that's quite close by adjusting for a linear term in age:

> lm(fev~smoking+age,data=fev)
Call:
lm(formula = fev ~ smoking + age, data = fev)
Coefficients:
(Intercept)      smoking          age

     0.2040      -0.2358       0.2479

This analysis is importantly different -- it uses all the data, including from ages where $\beta_k$ can't be estimated directly. This estimator relies on extrapolation of the linear FEV:age relationship to hypothetical younger smokers. Realistically, if you want a linear term in age, you should restrict to age>=10 or even further since that's where you have data. One of the advantages of thinking in terms of stratification is to motivate this sort of restriction.

Finally, generalisations.

Lin & Zeng have a couple of papers showing this is (approximately) true very generally, with the goal of showing that meta-analysis of study-specific estimates has (nearly) full efficiency compared to pooled analysis of individual data.

This is wonderful. Could you devise an example in R using fake data, as the question requests? — FlacoT, Jan 30 '24 at 15:58
Just to add: why do we want a precision-weighted average? I see that that's what makes these two equivalent. But what is the motivation for precision-weighting in general? — FlacoT, Feb 15 '24 at 18:57
If the effects are truly the same then the Gauss-Markov theorem says precision weighting is optimal. More generally, of all the weighted averages the precision-weighted average has the smallest variance. — Thomas Lumley, Feb 15 '24 at 20:19
@ThomasLumley this is fantastic. Last (hopefully) question: why don't we use the standard errors of the strata-specific regressions to define the weights? That would have been my intuition: we want to weight the betas by how precisely they are estimated, and the standard error from each strata tells us that — FlacoT, Feb 21 '24 at 23:52
Because we don't know the standard errors, just estimates of them, and these estimates are really bad when the strata are small. With large strata you might well use the estimated standard errors — Thomas Lumley, Feb 22 '24 at 01:27

Example showing equivalence between stratification and regression

1 Answers1

Now, an example

Finally, generalisations.

Linked