How to design DESeq2 LRT model with individuals nested in 2 levels?

Question

We have a complicated experimental design that we would like to perform LRT analysis for. Our main goal is to discover significant genes for the "Injection:Social" interaction term across the entire dataset by removing it from the LRT reduced model, and as a bonus we are also interested in discovering significant genes for that interaction term for each respective brain region.

Sample  Injection   Social  Region  Individual  ind.n
HY06    L   ISO HY  S06 S1
NST6    L   ISO NS  S06 S1
TN06    L   ISO TN  S06 S1
HY08    L   ISO HY  S08 S2
NST8    L   ISO NS  S08 S2
TN08    L   ISO TN  S08 S2
HY30    L   KF  HY  S30 S1
NST30   L   KF  NS  S30 S1
TN30    L   KF  TN  S30 S1
HY32    L   KF  HY  S32 S2
NST32   L   KF  NS  S32 S2
TN32    L   KF  TN  S32 S2
HY64    L   KFC HY  S64 S1
NST64   L   KFC NS  S64 S1
TN64    L   KFC TN  S64 S1
HY65    L   KFC HY  S65 S2
NST65   L   KFC NS  S65 S2
TN65    L   KFC TN  S65 S2
HY19    L   NF  HY  S19 S1
NST19   L   NF  NS  S19 S1
TN19    L   NF  TN  S19 S1
HY24    L   NF  HY  S24 S2
NST24   L   NF  NS  S24 S2
TN24    L   NF  TN  S24 S2
HY05    S   ISO HY  S05 S1
NST5    S   ISO NS  S05 S1
TN05    S   ISO TN  S05 S1
HY12    S   ISO HY  S12 S2
NST12   S   ISO NS  S12 S2
TN12    S   ISO TN  S12 S2
HY31    S   KF  HY  S31 S1
NST31   S   KF  NS  S31 S1
TN31    S   KF  TN  S31 S1
HY34    S   KF  HY  S34 S2
NST34   S   KF  NS  S34 S2
TN34    S   KF  TN  S34 S2
HY62    S   KFC HY  S62 S1
NST62   S   KFC NS  S62 S1
TN62    S   KFC TN  S62 S1
HY63    S   KFC HY  S63 S2
NST63   S   KFC NS  S63 S2
TN63    S   KFC TN  S63 S2
HY04    S   NF  HY  S04 S1
NST4    S   NF  NS  S04 S1
TN04    S   NF  TN  S04 S1
HY20    S   NF  HY  S20 S2
NST20   S   NF  NS  S20 S2
TN20    S   NF  TN  S20 S2

My first attempt was building simple full (m1) and reduced (m2) models that gets directly at our question of interest but doesn't control for nested individuals.

m1 <- model.matrix(~ Region + Social * Injection, colData_filt)
m2 <- model.matrix(~ Region + Social + Injection, colData_filt)

We want to control for individual/batch effects, which is nested within both "Injection" and "Social" but not region, as we have three brain regions per individual. I followed the example in the DESeq2 manual for creating a term (ind.n) distiguishing individuals nested within groups, but now I'm not sure how to create the full and reduced model given that I have one more level than the example.

I've tried a really elaborate full model (m1) with the interaction term of interest (Injection:Social) removed for the reduced model (m2), but I'm not sure this is correct based on our design.

m1 <- model.matrix(~ Injection + Injection:ind.n + Injection:Social + Injection:Region + Social + Social:ind.n + Social:Region + Region, colData_filt)
m2 <- model.matrix(~ Injection + Injection:ind.n + Injection:Region + Social + Social:ind.n + Social:Region + Region, colData_filt)

I'm assuming this is wrong, but even if this was by some miracle the correct formulation, would there be a way to extract genes that explain the "Injection:Social" interaction term for separate brain regions?

As a work-around, I subsetted the data by region and ran three separate LRT analyses for each subset and compared the results. While this simplified the model to look like the first example above, I worry that we lose some power by ignoring the fact we have multiple brain region samples from single individuals across the dataset.

Any guidance is much appreciated. Thanks in advance

StupidWolf · Accepted Answer · 2020-06-23T18:40:13.377

From what I can gather, you want to account for the effect of individual, nested within region. That is, you want to see after accounting for these, is there a consistent effect for Injection:Social across all conditions.

So you set up the model like this:

m1 <- model.matrix(~ ind.n*Region + Injection + Social + Injection:Social,data=..)

The last term should be Injection:Region and you can just use the waldTest (default) in DESeq2 for this term.

What does the terms do? ind.n*Region is the equivalent of ind.n + Region + ind.n:Region , and with this you effectively get an effect for every region in every individual.

Why don't we need the Injection:ind.n or Social:ind.n or Social:Region. These terms indicate the effect of Injection or Social can vary by individuals or regions. Most likely introducing too many parameters when you are interested in a common effect. Also you do not have the replicates or samples to distinguish this effect from region or other effects.

Since you provided an example we can run DESeq2 and you can see how the results look like:

mat = counts(makeExampleDESeqDataSet(n=1000,m=48))
dds = DESeqDataSetFromMatrix(mat,df,~ ind.n*Region + Injection + Social + Injection:Social)
dds = DESeq(dds)
resultsNames(dds)
 [1] "Intercept"            "ind.n_S2_vs_S1"       "Region_NS_vs_HY"

 [4] "Region_TN_vs_HY"      "Injection_S_vs_L"     "Social_KF_vs_ISO"

 [7] "Social_KFC_vs_ISO"    "Social_NF_vs_ISO"     "ind.nS2.RegionNS"

[10] "ind.nS2.RegionTN"     "InjectionS.SocialKF"  "InjectionS.SocialKFC"
[13] "InjectionS.SocialNF"

The terms you need are "InjectionS.SocialKF","InjectionS.SocialKFC", "InjectionS.SocialNF", and you can look at each of them:

head(results(dds,name="InjectionS.SocialNF"))
log2 fold change (MLE): InjectionS.SocialNF 
Wald test p-value: InjectionS.SocialNF 
DataFrame with 6 rows and 6 columns
              baseMean     log2FoldChange             lfcSE               stat
             <numeric>          <numeric>         <numeric>          <numeric>
gene1  9.9811166787259   1.25304112986447 0.819806376919295   1.52845984752303
gene2 30.3449455820337 0.0329442893152027 0.705199688255367 0.0467162562092241
gene3 3.83223545055379    1.0281136369045  1.64095596190233  0.626533350543196
gene4  11.232305747171  0.595738624408923   0.8243883031544  0.722643227868976
gene5 6.70950627004097  0.756449993378065   1.0631622863378  0.711509430967263
gene6 26.1431134888287 -0.854784518963918 0.625714541243558  -1.36609342219393
                 pvalue              padj
              <numeric>         <numeric>
gene1 0.126398405431826 0.978671002658464
gene2 0.962739373909937 0.999897888026606
gene3 0.530965168963838 0.978671002658464
gene4 0.469899103018657 0.978671002658464
gene5 0.476768608734069 0.978671002658464
gene6 0.171909642630577 0.978671002658464

As mentioned, you can do a LRT if you want to test all the Injection:Social term interactions terms at one go, that is, the null hypothesis is that all of them are zero:

dds = nbinomLRT(dds,reduced=~ ind.n*Region + Injection + Social)
results(dds)

Usually the individual terms are more intuitively than testing all of them are zero, but you might have a special need for this.

Thank you, I will try this soon. Just to clarify, an LRT would still be appropriate for the larger question of finding the effect of the "Injection:Social" term and you suggest Wald for determining the region-specific impacts? Would it also be valid include a term like "Injection:Social:Region" in the suggested Wald model formulation in an attempt to explore the interactive effects of "Injection:Social" by region, or would that require more 3-way interaction terms with "ind.n"? — jfaberha, Jun 23 '20 at 17:30
@jfaberha, thanks for providing the data, i have updated my answer, hopefully its clearer now. You can include "Injection:Social:Region" only if you include Region:Social. My point is this, do you think there is such an effect? — StupidWolf, Jun 23 '20 at 18:41
I would visualize this with a pca or correlation matrix to see where the effects are coming from before applying them onto the model. No need to have a super complicated model... — StupidWolf, Jun 23 '20 at 18:42
To answer your question regarding the "Injection:Social:Region" term, I will do some more research to see if this has an appreciable effect in the dataset. One question posed by the PI of this study is - what are the genes responsive to the "Injection:Social" interaction term in each of the different brain regions and are they different? I was guessing that maybe a 3-way interaction term could address this in the binomial Wald analysis, but I agree that adds excessive complexity to the model. That's said, I'm not sure how to get region-specific interactions without running separate analyses. — jfaberha, Jun 23 '20 at 19:49
hmmm what is the main question? it sounds like everything under the sun. So it is indeed Injection:Social with different brain regions, you are better off running separate analysis for each region. It is easier for you to explain and honestly you don't gain power by analyzing all together — StupidWolf, Jun 23 '20 at 20:16
The main question is regarding the Injection:Social term across the dataset, with the brain region-specific interactions being less important follow-up. That said, I will move forward with LRT for the primary question and separate analyses for the region-specific analyses. Thank you for all the help! — jfaberha, Jun 23 '20 at 20:37
Hi @jfaberha remember to upvote SW, LRTs are powerful statistics but never trivial to implement and that is a good answer. — M__, Jun 23 '20 at 20:47
in LRT can we use reference level which I use in pairwise comparison i have a made a similar query Since i m not sure if I use LRT which would be my reference in a reduced model — kcm, Mar 28 '22 at 10:06

How to design DESeq2 LRT model with individuals nested in 2 levels?

1 Answers1

Linked