How should I address batch effects in my experiment?

Question

Let's say I have an RNA-Seq experiment, where I'm interested in the significantly differentiated genes between pre-treatment and post-treatment conditions. "rep" == biological replicate.

Sample PreA (3 reps **all** in 2017) vs Sample PostA (3 reps **all** in 2018) 
Sample PreB (3 reps **all** in 2017) vs Sample PostB (3 reps **all** in 2018) 
Sample PreC (3 reps **all** in 2017) vs Sample PostC (3 reps **all** in 2018)

I also have a sample:

Sample PreD (3 reps all in 2017) and Sample PreD (3 reps again in 2018)

It is safe to assume the samples in sequenced in 2017 were done differently to 2018.

Q1: Is there problem with batch effects in my study? When I do a test, there should't be any way to split variation from batch effects and biological variation?

Q2: If Q1 is correct. How should I reduce batch effects?

Can I add a batch factor to my model in DESeq2/edgeR?
Should I use my PreD control samples to derive normalisation factor as in RUV? (assume PreD as a negative control)
Should I use sva package?

EDIT

Devon suggested for adding a batch variable, but how? That's how I'm doing (ignoring batch effects):

dds <- DESeqDataSetFromTximport(txi, data, ~Sample)

data holds a data frame with a Sample column. Sample would be "PreA" and "PostA" in the first test. Very simple.

I couldn't add Date to it, like:

dds <- DESeqDataSetFromTximport(txi, data, ~Date + Sample)

because the code crashed complaining of the 2017/2018 problem discussed above. I'm not sure how I should add PreD to the design.

Devon Ryan · Answer 1 · 2018-11-28T09:31:30.483

4

Yes, though thankfully Your 2018 PreD samples will help you resolve this.
Simply add the batch effect to the design (~Batch + Treatment) and DESeq2 (or edgeR or Limma) will handle this for you.

You do not need SVA or RUV, thankfully, since you quite cleverly sequenced one group in both batches.

To clarify, your coldata will be something like:

Group    Time    Batch
A        Pre     A
A        Pre     A
A        Pre     A
A        Post    B
A        Post    B
A        Post    B
...
D        Pre     A
D        Pre     A
D        Pre     A
D        Pre     B
D        Pre     B
D        Pre     B

And the design is actually ~Batch + Group + Time. You can probably add an interaction term in if you then modify the model matrix manually to ensure it's full rank.

edited Nov 28 '18 at 09:31

answered Nov 28 '18 at 09:19

Devon Ryan

19,602
2
29
60

Thanks. PreA vs PostA, involve no "PreD" at all. How should I address it? "PreD" is a sample name, not like a factor like as.factor(c("2017", "2018")). – SmallChess Nov 28 '18 at 09:28
You're really interested in the A:Post interaction. If you phrase it and think of it that way the model matrix is easier to construct (see my update to get you started). – Devon Ryan Nov 28 '18 at 09:32

How should I address batch effects in my experiment?

1 Answers1

Linked