Overview: In my model building process, I fit both GLMs and GLMMs. I noticed that the GLMs suggested overdispersion in the data, while the GLMMs suggested underdispersion. How can I make sense of this, and what are reasonable next steps for further model assessment and fitting?
I have been working on fitting a model to data with a 'count' outcome of the total number of cells exhibiting a phenotype. Notably, the data setup includes repeated measures, so my end goal is to fit a GLMM. A general idea of the data:
There is a hierarchical structure: each subject (SubID) can contribute more than one sample of different types (which is represented by a predictor variable 'SampleType'). And, for each sample (cellID), we can observe the count multiple times (at least once, at most twice; which is represented by another predictor variable 'compartment'). There is no time difference between measurements from the same SubID/cellID, just the fact they can come from the same SubID/cellID.
The outcome includes zeros.
> table(dfmod$tfcount == 0)
FALSE TRUE
681 152
A visualization:
hist(dfmod$tfcount, breaks = seq(0,max(dfmod$tfcount) + 100, 50))
A bit more detail on the lower counts:
dfclose <- dfmod %>% filter(tfcount < 500)
hist(dfclose$tfcount, breaks = seq(0, max(dfclose$tfcount) + 100, 5))
Where we have a small number of counts that are very large as compared to the rest.
table(dfmod$tfcount > 2000)
FALSE TRUE
814 19
I include the above for context as my question pertains to over/underdispersion. For all models, I include an offset for the total number of cells.
The first basic model I tried was a Poisson GLM, which made me think there was overdispersion based off the ratio of residual deviance/df.
modpois <- glm(tfcount ~ SampleType*compartment,
family = "poisson",
offset = log(totalcompcount),
data = dfmod)
summary(modpois)
AIC(modpois)
Residual deviance: 204533 on 825 degrees of freedom
AIC: 208035
Fitting a negative binomial using the MASS package appears to help, and we can see the theta parameter is less than 1 and AIC is greatly reduced. Due to the parameterization of the negative binomial in glm.nb, I interpret the theta < 1 as suggesting overdispersion.
library(MASS)
modnb <- glm.nb(tfcount ~ SampleType*compartment + offset(log(totalcompcount)),
data = dfmod)
summary(modnb)
AIC(modnb)
Theta: 0.5139
Std. Err.: 0.0251
AIC: 7370.856
Now that I had this in mind, I went on to fit a GLMM using package "glmmTMB". Based off my data structure, I thought I should fit a model like this:
lme.modnb <- glmmTMB(tfcount ~ SampleType*compartment + offset(log(totalcompcount)) +
(SampleType|SubID/cellID) + (compartment|SubID/cellID),
data = dfmod,
family = nbinom2(link = "log"))
but it does not converge.
However, the one below does, so I will use it for illustration:
lme.modnb <- glmmTMB(tfcount ~ SampleType*compartment + offset(log(totalcompcount)) +
(SampleType|SubID/cellID),
data = dfmod,
family = nbinom2(link = "log"))
Conditional model:
Groups Name Variance Std.Dev. Corr
cellID:SubID (Intercept) 0.3362 0.5798
SampleTypeT 0.7811 0.8838 -0.07
SampleTypeTLN 0.5508 0.7422 -0.22 0.02
SampleTypeNLLN 0.8298 0.9109 -0.04 0.00 0.01
SubID (Intercept) 1.2418 1.1143
SampleTypeT 2.7074 1.6454 -0.55
SampleTypeTLN 2.0518 1.4324 -0.11 0.27
SampleTypeNLLN 3.6174 1.9019 -0.37 0.31 0.46
Number of obs: 833, groups: cellID:SubID, 426; SubID, 71
Dispersion parameter for nbinom2 family (): 16
AIC: 6524
Notably, the dispersion parameter is much greater than 1 now. Per the glmmTMB pdf
"nbinom2 Negative binomial distribution: quadratic parameterization (Hardin & Hilbe 2007). V = μ(1 + μ/φ) = μ + μ2/φ"
which makes me believe this model is suggesting underdispersion. Based off the lower AIC (admittedly, just one measure of fit), I would lean towards this model.
To restate my question posed in the overview: Is there an explanation for this discrepancy in the GLMM model suggesting underdispersion and the GLM model (no random effects) suggesting overdispersion? What are ideas for next steps for delineating this issue or alternative models to fit the data?
My thoughts: My random effects specification is incorrect, causing this suggestion of underdispersion.
Alternatively, after allowing random intercepts, these random intercepts "accounted" for the overdispersion observed in the GLM.
The nature of the outcome (i.e., the number of zeros as well as the small number of counts > 2000) is making it hard to fit these models.
Or maybe a zero-inflated model is needed... (presenting (1|cellID) as a more complex random effects structures would not converge.
zero.lme.modnb <- glmmTMB(tfcount ~ SampleType*compartment + offset(log(totalcompcount)) +
(1|cellID),
ziformula = ~1,
data = dfmod,
family = nbinom2(link = "log"))
summary(zero.lme.modnb)
Zero-inflation model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.15 1820.56 -0.012 0.991
Dispersion parameter for nbinom2 family (): 15.9
AIC: 6647.1
Interestingly, the zero-inflation model estimate is extremely negative, but a large p-value.
