Why would a model indicate overdispersion without random effects but underdispersion with random effects? (and how to handle)

Question

Overview: In my model building process, I fit both GLMs and GLMMs. I noticed that the GLMs suggested overdispersion in the data, while the GLMMs suggested underdispersion. How can I make sense of this, and what are reasonable next steps for further model assessment and fitting?

I have been working on fitting a model to data with a 'count' outcome of the total number of cells exhibiting a phenotype. Notably, the data setup includes repeated measures, so my end goal is to fit a GLMM. A general idea of the data:

There is a hierarchical structure: each subject (SubID) can contribute more than one sample of different types (which is represented by a predictor variable 'SampleType'). And, for each sample (cellID), we can observe the count multiple times (at least once, at most twice; which is represented by another predictor variable 'compartment'). There is no time difference between measurements from the same SubID/cellID, just the fact they can come from the same SubID/cellID.

The outcome includes zeros.

> table(dfmod$tfcount == 0)
FALSE  TRUE 
  681   152

A visualization:

hist(dfmod$tfcount, breaks = seq(0,max(dfmod$tfcount) + 100, 50))

A bit more detail on the lower counts:

dfclose <- dfmod %>% filter(tfcount < 500)
hist(dfclose$tfcount, breaks = seq(0, max(dfclose$tfcount) + 100, 5))

Where we have a small number of counts that are very large as compared to the rest.

table(dfmod$tfcount > 2000)
FALSE  TRUE 
  814    19

I include the above for context as my question pertains to over/underdispersion. For all models, I include an offset for the total number of cells.

The first basic model I tried was a Poisson GLM, which made me think there was overdispersion based off the ratio of residual deviance/df.

modpois <- glm(tfcount ~ SampleType*compartment,
            family = "poisson",
            offset = log(totalcompcount),
            data = dfmod)
summary(modpois)
AIC(modpois)
Residual deviance: 204533  on 825  degrees of freedom
AIC: 208035

Fitting a negative binomial using the MASS package appears to help, and we can see the theta parameter is less than 1 and AIC is greatly reduced. Due to the parameterization of the negative binomial in glm.nb, I interpret the theta < 1 as suggesting overdispersion.

library(MASS)
modnb <- glm.nb(tfcount ~ SampleType*compartment + offset(log(totalcompcount)),
                data = dfmod)
summary(modnb)
AIC(modnb)
Theta:  0.5139 
Std. Err.:  0.0251
AIC: 7370.856

Now that I had this in mind, I went on to fit a GLMM using package "glmmTMB". Based off my data structure, I thought I should fit a model like this:

lme.modnb <- glmmTMB(tfcount ~ SampleType*compartment + offset(log(totalcompcount)) +
                       (SampleType|SubID/cellID) + (compartment|SubID/cellID),
                     data = dfmod,
                     family = nbinom2(link = "log"))

but it does not converge.

However, the one below does, so I will use it for illustration:

lme.modnb <- glmmTMB(tfcount ~ SampleType*compartment + offset(log(totalcompcount)) +
                       (SampleType|SubID/cellID),
                     data = dfmod,
                     family = nbinom2(link = "log"))
Conditional model:
 Groups       Name           Variance Std.Dev. Corr

 cellID:SubID (Intercept)    0.3362   0.5798

              SampleTypeT    0.7811   0.8838   -0.07

              SampleTypeTLN  0.5508   0.7422   -0.22  0.02

              SampleTypeNLLN 0.8298   0.9109   -0.04  0.00  0.01 
 SubID        (Intercept)    1.2418   1.1143

              SampleTypeT    2.7074   1.6454   -0.55

              SampleTypeTLN  2.0518   1.4324   -0.11  0.27

              SampleTypeNLLN 3.6174   1.9019   -0.37  0.31  0.46 
Number of obs: 833, groups:  cellID:SubID, 426; SubID, 71
Dispersion parameter for nbinom2 family ():   16
AIC: 6524

Notably, the dispersion parameter is much greater than 1 now. Per the glmmTMB pdf

"nbinom2 Negative binomial distribution: quadratic parameterization (Hardin & Hilbe 2007). V = μ(1 + μ/φ) = μ + μ2/φ"

which makes me believe this model is suggesting underdispersion. Based off the lower AIC (admittedly, just one measure of fit), I would lean towards this model.

To restate my question posed in the overview: Is there an explanation for this discrepancy in the GLMM model suggesting underdispersion and the GLM model (no random effects) suggesting overdispersion? What are ideas for next steps for delineating this issue or alternative models to fit the data?

My thoughts: My random effects specification is incorrect, causing this suggestion of underdispersion.

Alternatively, after allowing random intercepts, these random intercepts "accounted" for the overdispersion observed in the GLM.

The nature of the outcome (i.e., the number of zeros as well as the small number of counts > 2000) is making it hard to fit these models.

Or maybe a zero-inflated model is needed... (presenting (1|cellID) as a more complex random effects structures would not converge.

zero.lme.modnb <- glmmTMB(tfcount ~ SampleType*compartment + offset(log(totalcompcount)) +
                            (1|cellID),
                   ziformula = ~1,
                   data = dfmod,
                   family = nbinom2(link = "log"))
summary(zero.lme.modnb)
Zero-inflation model:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -21.15    1820.56  -0.012    0.991
Dispersion parameter for nbinom2 family (): 15.9
AIC: 6647.1

Interestingly, the zero-inflation model estimate is extremely negative, but a large p-value.

I think you need to choose whether to include random effects based on your study design, not the output from analysis. It sounds like you have multiple readings for some, but not all people. This can be problematic. It's surely a violation of independence to include all observations, but it throws away information to only include one observation per person. But I think that is what you may have to do, if the multlievel model won't converge. But I am not positive about that. — Peter Flom, Sep 08 '23 at 20:01
@PeterFlom Thanks for the comment. I agree that random effects should be based on study design, and I think my study design certainly implies random effects should be included. To be clear, the GLM (no REs) models were just exploratory. But considering them and the estimated \theta made me question if the discrepancy with the estimated \theta in the GLMMs (with REs) are a cause for concern, and wonder why this is happening. — Reid, Sep 08 '23 at 20:07

Why would a model indicate overdispersion without random effects but underdispersion with random effects? (and how to handle)

0 Answers0