Let's say we have data from a small 2-arm pilot trial with baseline imbalance. I wish to compare two approaches to the analysis:
Regression of endline (post-treatment) outcome data on an indicator of study arm that controls for baseline values on the outcome of interest
Difference-in-differences
Here's some toy data that mimics a real example.
# function for simulating data with fixed parameters
# https://stackoverflow.com/a/19343398/841405
mysamp <- function(n, m, s, lwr, upr, nnorm) {
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
}
# load packages
library(tidyverse)
# simulate baseline and endline data based on real-world example
# some loss-to-followup at endline
set.seed(42)
base.t <- mysamp(n=32, m=3.03, s=0.46, lwr=0, upr=24, nnorm=1000)
base.c <- mysamp(n=32, m=4.53, s=0.52, lwr=0, upr=24, nnorm=1000)
end.t <- mysamp(n=23, m=2.21, s=0.63, lwr=0, upr=24, nnorm=1000)
end.c <- mysamp(n=22, m=2.23, s=0.39, lwr=0, upr=24, nnorm=1000)
# create long data
dat <- data.frame(id=c(seq(1:32), # control, baseline
seq(1:22), # control, 3 month
seq(from=33, to=64), # treatment, baseline
seq(from=33, to=55)),# treatment, 3 month
trt=c(rep(0, 32+22), # control
rep(1, 32+23)), # treatment
end3mo=c(rep(0, 32), # control, baseline
rep(1, 22), # control, 3 month
rep(0, 32), # treatment, baseline
rep(1, 23)), # treatment, 3 month
score=c(base.c, # control, baseline
end.c, # control, 3 month
base.t, # treatment, baseline
end.t)) # treatment, 3 month
# reshape wide
datw <-
dat %>%
mutate(end3mo = case_when(end3mo==1 ~ "end3mo",
TRUE ~ "baseline")) %>%
group_by(end3mo) %>%
spread(end3mo, score)
Here's the result for the first approach, regressing endline outcome scores on an indicator of study assignment and controlling for baseline data.
# controlling for baseline
summary(lm(end3mo ~ trt + baseline, data=datw))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.69332 0.87332 3.084 0.0036 **
#trt -0.19323 0.31856 -0.607 0.5474
#baseline -0.09929 0.19330 -0.514 0.6102
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here's the result of the second approach, difference-in-differences:
# difference in differences estimate (interaction)
summary(lm(score ~ trt*end3mo, data=dat))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 4.55646 0.09029 50.467 < 2e-16 ***
#trt -1.46102 0.12768 -11.443 < 2e-16 ***
#end3mo -2.30747 0.14145 -16.313 < 2e-16 ***
#trt:end3mo 1.40694 0.19875 7.079 1.7e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Results:
- Coefficient on
trtis is -0.19323 (Not shown: LOCF for missing data at endline givestrtof 0.39223) - DID estimate (
trt:end3mo) is 1.4