I am trying to analyze the data with the pre-post-control design in the context of RNA-seq analysis.
I have read Best practice when analyzing pre-post treatment-control designs, but I am still confused because the method recommended by this manual(page 34) is not mentioned in the above post.
To briefly describe my data, the response is gene expression measurement and explanatory variables contain 3 factors(treatment, patient, and time): there are 2 levels(drug & control) in treatment, total 10 patients, and two levels(pre and post) in time.
The goal is to figure out if the drug has any effect on the gene expression while controlling for heterogeneity among patients.
The software manual suggests using interaction terms to control for individual effects, and use the following model:
$$expression \sim treatment + treatment:patient + treatment :time$$
and the code snippet below describes the idea:
require(dplyr)
dat <- structure(list(treatment = c("control", "control", "control",
"control", "control", "control", "control", "control", "control",
"control", "drug", "drug", "drug", "drug", "drug", "drug", "drug",
"drug", "drug", "drug"), time = c("pre", "post", "pre", "post",
"pre", "post", "pre", "post", "pre", "post", "pre", "post", "pre",
"post", "pre", "post", "pre", "post", "pre", "post"), patient = c(1,
1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), .Names = c("treatment",
"time", "patient"))
# prepare for the design matrix
dat <- dat %>% group_by(treatment) %>%
mutate(patient=paste0("p",rep(c(1:5),each=2))) %>%
ungroup() %>%
mutate(time=factor(time, levels = c("pre","post")))
treatment time patient
1 control pre p1
2 control post p1
3 control pre p2
4 control post p2
5 control pre p3
6 control post p3
7 control pre p4
8 control post p4
9 control pre p5
10 control post p5
11 drug pre p1
12 drug post p1
13 drug pre p2
14 drug post p2
15 drug pre p3
16 drug post p3
17 drug pre p4
18 drug post p4
19 drug pre p5
20 drug post p5
# construct a model matrix
mmat <- model.matrix(~treatment + treatment:patient + treatment:time,data = dat)
# terms that will be in my linear model
colnames(mmat)
[1] "(Intercept)" "treatmentdrug"
[3] "treatmentcontrol:patientp2" "treatmentdrug:patientp2"
[5] "treatmentcontrol:patientp3" "treatmentdrug:patientp3"
[7] "treatmentcontrol:patientp4" "treatmentdrug:patientp4"
[9] "treatmentcontrol:patientp5" "treatmentdrug:patientp5"
[11] "treatmentcontrol:timepost" "treatmentdrug:timepost"
To see the effect of the drug, the manual suggests check if the coefficient "treatmentdrug:timepost" is significant by computing log likelihood ratio between the model without that term and the model with that term and see if the nested model is sufficiently different from the full model.
Now my question is :
- what's weakness of this approach of using interaction terms to control for patient effects? In other words, what assumptions am I breaking(if any)?
- It seems that random effects models are used in this context, but what are the benefits?
- If you were me, how would you have tackled this problem?