Dealing with Overdispersed Negative Binomial using glmmTMB

Question

I'm new to the world of statistical modeling, but I was wondering if anyone had any input on how to handle over dispersed negative binomial data? I'm working on modeling bat activity as a response variable against a variety of insect, vegetation, and environmental variables. My objective is to see which explanatory variables (whether it be insect, vegetation, and/or environmental) are impacting bat activity the most.

My response variable is bat activity (count data) with an offset for # of survey nights the acoustic detectors ran for and is seemingly quite overdispersed. I've run Poisson models, all with the conclusion that they are overdispersed, so I've moved onto NB2 models using glmmTMB package; all predictor variables are scaled and centered. Below is the str of a few explanatory variables:

$ Year            :  Factor w/ 2 levels "2017", "2018": 1 1 1 1 1 1 1 1 1 1 1
 $ Habitat         : Factor w/ 4 levels "MCF","MM","MMF",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ Site            : Factor w/ 63 levels "MCF_001","MCF_002",..: 1 2 3 4 5 6 8 9 17 19 ...
 $ Bats            : int  4 1 47 61 5 14 7 84 6 3 ...
 $ Mylu            : int  3 0 38 13 0 1 0 6 4 0 ...
 $ Myse            : int  0 0 3 5 3 3 0 16 0 0 ...
 $ Survey.Nights   : int  4 5 6 4 4 4 5 4 4 5 ...
 $ Avg.Biomass     : num  -0.381 -0.481 0.908 -0.574 0.943 ...
 $ Shannon.Weaver  : num  -0.6412 0.0586 -0.2082 0.7039 0.7002 ...
 $ Num.Orders      : num  0.0711 -1.8912 0.0711 -1.8912 1.0522 ...
 $ Avg.Snags       : num  -0.851 1.837 0.224 0.493 -0.851 ...
 $ Avg.Understory  : num  -0.00711 -0.94428 3.51112 3.58282 0.55621 ...
 $ Avg.Midstory    : num  -0.35 0.255 -0.461 -0.589 -0.295 ...
 $ Avg.Canopy      : num  -1.056 0.692 1.129 1.129 0.911 ...
 $ Avg.Canopy.Cover: num  -0.822 0.514 1.182 0.982 1.182 ...
 $ Perc.Dec.Dom    : num  -0.491 -1.091 -1.942 -1.546 0.61 ...
 $ Avg.Bat.Date    : num  -0.7704 -0.9971 -0.2208 -0.2208 -0.0834 ...
 $ Avg.Bat.Night.Hr: num  -0.843 -0.951 -0.407 -0.429 -0.299 ...
 $ Avg.Bat.Temp    : num  0.5214 -0.5578 -1.0893 -0.2349 -0.0632 ...
 $ Bat.Dist.Edge   : num  -0.879 -0.432 -0.179 1.544 0.616 ...
 $ Bat.Elevation   : num  -0.741 -0.575 -0.12 -0.171 0.356 ...
 $ Bat.Moon        : num  0.667 -0.279 0.794 0.857 0.352 ...

nbin <- glmmTMB(Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory + 
    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + Avg.Bat.Temp +
    Bat.Elevation + Bat.Moon + Bat.Water.Feat + Avg.Biomass + Num.Orders + 
    Avg.Bat.Temp*Avg.Bat.Date + Avg.Biomass*Year + Year + Habitat + 
    offset(log(Survey.Nights)) + (1|Site), 
    data = insect.data, 
    ziformula = ~0, 
    family = nbinom2)
summary(nbin)
Family: nbinom2  ( log )
Formula:          Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory +

    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + 
    Avg.Bat.Temp + Bat.Elevation + Bat.Moon + Bat.Water.Feat +
    Avg.Biomass + Num.Orders + Avg.Bat.Temp * Avg.Bat.Date +

    Avg.Biomass * Year + Year + Habitat + offset(log(Survey.Nights)) +

(1 | Site)
Data: insect.data
 AIC      BIC   logLik deviance df.resid 
 539      588     -247      495       47 


Random effects:
Conditional model:
 Groups Name        Variance Std.Dev.
 Site   (Intercept) 2.44e-09 4.94e-05
Number of obs: 69, groups:  Site, 36
Overdispersion parameter for nbinom2 family (): 2.47
Conditional model:
                          Estimate Std. Error z value Pr(>|z|)

(Intercept)                  0.526      0.572    0.92  0.35763

Avg.Biomass                 -1.866      0.390   -4.78  1.7e-06 ***
Num.Orders                   0.876      0.136    6.44  1.2e-10 ***
Avg.Understory               0.431      0.120    3.58  0.00034 ***
Avg.Midstory                -2.148      0.319   -6.72  1.8e-11 ***
Avg.Canopy.Cover             0.465      0.190    2.45  0.01420 *

Perc.Dec.Dom                 0.498      0.181    2.74  0.00606 ** 
Avg.Snags                    0.694      0.142    4.88  1.1e-06 ***
Avg.Bat.Date                 0.110      0.169    0.65  0.51553

Avg.Bat.Temp                -0.197      0.205   -0.96  0.33524

Bat.Elevation               -0.360      0.126   -2.86  0.00429 ** 
Bat.Moon                     0.541      0.111    4.85  1.2e-06 ***
Bat.Water.FeatRiver         -0.315      0.559   -0.56  0.57312

Bat.Water.FeatStream         7.018      1.330    5.28  1.3e-07 ***
Year2018                     0.169      0.312    0.54  0.58789

HabitatMM                    0.185      0.383    0.48  0.62982

HabitatMMF                   0.146      0.348    0.42  0.67448

HabitatREGEN                 1.121      0.356    3.15  0.00164 ** 
Avg.Bat.Date:Avg.Bat.Temp   -0.392      0.196   -2.00  0.04514 *

Avg.Biomass:Year2018         1.500      0.375    4.00  6.2e-05 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

res <- simulateResiduals(nbin)
plot(res,rank = T)
testResiduals(res)
data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided
> testResiduals(res)
$uniformity
One-sample Kolmogorov-Smirnov test


data:  simulationOutput$scaledResiduals
D = 0.05, p-value = 1
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated


data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test


data:  simulationOutput
outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5
alternative hypothesis: two.sided
$uniformity
One-sample Kolmogorov-Smirnov test


data:  simulationOutput$scaledResiduals
D = 0.05, p-value = 1
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated


data:  simulationOutput
ratioObsSim = 0.7, p-value = 0.4
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test


data:  simulationOutput
outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5
alternative hypothesis: two.sided][1]][1]

Then, I wanted to manually check dispersion and here's where I ran into some concerns

m1 <- nbin
dispfun <- function(m) {
        r <- residuals(m,type="pearson")
        n <- df.residual(m)
        dsq <- sum(r^2)
        c(dsq=dsq,n=n,disp=dsq/n)
}
options(digits=2)
dispfun(m1)
dsq    n disp 
76.1 47.0  1.6

This seems to indicate overdispersion in my model, however, I've already added covariates (as you can see, my model is quite complex and this is after dropping non-significant factors), and adding interactions (Hilbe 2011 suggestions). However, the DHARMa residuals look fairly decent. Which should I trust? Does anyone have any suggestions on how to handle this?

I reran with GLMMadaptive and got the following output with a different dispersion parameter:

Call:
mixed_model(fixed = Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + 
    Avg.Midstory + Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + 
    Avg.Bat.Date + Avg.Bat.Temp + Bat.Elevation + Bat.Moon + 
    Bat.Water.Feat + Avg.Biomass + Num.Orders + Avg.Bat.Temp * 
    Avg.Bat.Date + Avg.Biomass * Yr + Num.Orders * Yr + Avg.Bat.Date * 
    Bat.Moon + Yr + Habitat + offset(log(Survey.Nights)), random = (~1 | 
    Site), data = insect.data2, family = negative.binomial(), 
    iter_EM = 300)
Data Descriptives:
Number of Observations: 67
Number of Groups: 36
Model:
 family: negative binomial
 link: log
Fit statistics:
   log.Lik      AIC      BIC
 -230.2856 508.5711 546.5756
Random effects covariance matrix:
               StdDev
(Intercept) 0.0514579
Fixed effects:
                          Estimate Std.Err z-value    p-value
(Intercept)                 0.7447  0.5482  1.3584 0.17434114
Avg.Biomass                -1.5392  0.3861 -3.9871    < 1e-04
Num.Orders                  0.4840  0.1862  2.5987 0.00935661
Avg.Understory              0.2471  0.1299  1.9023 0.05713095
Avg.Midstory               -2.3953  0.3624 -6.6098    < 1e-04
Avg.Canopy.Cover            0.6657  0.1879  3.5422 0.00039685
Perc.Dec.Dom                0.5743  0.1737  3.3059 0.00094668
Avg.Snags                   0.5411  0.1494  3.6217 0.00029270
Avg.Bat.Date               -0.0040  0.1860 -0.0217 0.98266247
Avg.Bat.Temp               -0.7496  0.2795 -2.6818 0.00732270
Bat.Elevation              -0.3307  0.1270 -2.6032 0.00923670
Bat.Moon                    0.5336  0.1206  4.4251    < 1e-04
Bat.Water.FeatRiver        -0.7486  0.5586 -1.3402 0.18017727
Bat.Water.FeatStream        7.1474  1.4996  4.7663    < 1e-04
Yr2018                      0.4797  0.3066  1.5643 0.11774826
HabitatMM                  -0.0861  0.3768 -0.2285 0.81928969
HabitatMMF                 -0.3509  0.3605 -0.9735 0.33030629
HabitatREGEN                1.0362  0.3399  3.0486 0.00229947
Avg.Bat.Date:Avg.Bat.Temp  -0.6803  0.2172 -3.1324 0.00173393
Avg.Biomass:Yr2018          1.1956  0.3758  3.1815 0.00146534
Num.Orders:Yr2018           0.6276  0.2661  2.3584 0.01835350
Avg.Bat.Date:Bat.Moon       0.3587  0.1782  2.0130 0.04411454
log(dispersion) parameter:
  Estimate Std.Err
    1.0421  0.2256
Integration:
method: adaptive Gauss-Hermite quadrature rule
quadrature points: 11
Optimization:
method: hybrid EM and quasi-Newton
converged: TRUE

My main objective is to see which explanatory variables could be driving bat activity in my study area and whether bat activity can be predicted based off of insect activity, different vegetative characteristics, and/or environmental factors.

@AlexandreCazenave-Lacroutz My main objective is to see which explanatory variables could be driving bat activity in my study area and whether bat activity can be predicted based off of insect activity, different vegetative characteristics, and/or environmental factors. — dwash7, Jul 16 '19 at 17:46

score 4 · Answer 1 · answered Jul 16 '19 at 18:07

4

A couple of points:

The variance of the random effect for site is extremely low. This could either mean that there is no correlations in the bat activity within a site or that could be an artefact of the Laplace approximation used behind glmmTMB() to approximate the integrals of the random effects. You could also try fitting the same model with the GLMMadaptive package that approximates the same integrals with the adaptive Gaussian quadrature procedure that can be more accurate. You can find examples here and here.
It would be better to check the fit of the model, and possible remaining over-dispersion using the scaled simulated residuals of the DHARMa package. An example using this package to check the fit of a negative binomial model can be found here.
It would be better to define the variables as factors beforehand and not inside the formula. Moreover, are you sure that you need all these interaction terms?

answered Jul 16 '19 at 18:07

Dimitris Rizopoulos

21,024
2
20
47

thanks for the reply. I've been trying to install DHARMa and been running into countless install issues, so I have not been able to try that yet. In Re: to your first bullet, I have Site incorporated so that I do not violate independence. See edit for output when rerunning with GLMMadaptive. – dwash7 Jul 17 '19 at 13:40
@Dmitris Rizopoulos also, I added the interactions in efforts to fix any apparent overdispersion (using suggestion from Hilbe 2011) – dwash7 Jul 17 '19 at 14:04
@dwash7 the models has not converged because it says converged: FALSE in the output. Try setting mixed_model(..., iter_EM = 0). – Dimitris Rizopoulos Jul 17 '19 at 17:59
@Dmitris, I updated the above post with your suggestion and it still has not converged – dwash7 Jul 17 '19 at 18:13
@dwash7 it is a bit difficult without having the data to try, but check whether iter_EM = 300 helps. – Dimitris Rizopoulos Jul 17 '19 at 18:22
thanks for your patience! That did the trick to fix the convergence issues (updated above) – dwash7 Jul 18 '19 at 11:10

score 1 · Answer 2 · answered Jul 17 '19 at 08:57

1

If you are interested only in getting the respective impact of each variables...you may keep the Poisson specification (with Robust Standard errors), despite any overdispertion.

Indeed, when the assumption $E(Y|X)=e^{X\beta}$ is a reasonnable assumption (which is supposed for both the Poisson and the Negative Binomial), the Poisson estimates of the $\beta$ are consistent, whatever the underlying dispertion. (Because there is in fact no need to use the hypothesis that variance equals the mean, see Wooldridge, J. M. (1999). Quasi-likelihood methods for count data. Handbook of applied econometrics, 2.; https://www.researchgate.net/publication/247320048_Quasi-Likelihood_Methods_for_Count_Data )

A blog entry discussing it is: https://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/ (please read the "Finally, I would like to tell you that everyone" and the following sections)

answered Jul 17 '19 at 08:57

Alexandre C-L

819

I've been using Zuur et al. 2009 and Hilbe 2011, both of which suggest that NB is a solution to overdispersed count data. I've run a few quasi-Poisson models, but the newer packages don't support quasi-families, so I assumed they were inferior to using NB. Am I misunderstanding something? Even the blog did seem to come to the conclusions that NB seems to do fine with overdispersed data. Please, correct me if I am misunderstanding. – dwash7 Jul 17 '19 at 13:52
"the newer package" ?
In my understanding:

Poisson is theoretically a better choice to get consistent estimates of the $\beta$. The provided blog entry gets that NB does not do a bad job in one specific example. But that this question was raised by the author of the blog shows that the common sens: "Oh, there is overdispersion, I should use a NB rather than a Poisson" is not adequate. If you consider a specification of the form $E(Y|X)=e(X\beta)$ (as for NB and for Poisson) in order to get the $\beta$, your first choice should be Poisson. (...)

Alexandre C-L

Jul 17 '19 at 14:55

(...) Overdispersion is not what should make you choose NB rather than Poisson for your purpose.

As I have the feeling (some pages are missing on Google Book) that Hilbe (2011) does not quote "Wooldridge, J. M. (1999). Distribution-free estimation of some nonlinear panel data models. Journal of Econometrics, 90(1), 77-97." in a whole book on NB models, whereas it states in the introduction that Poisson is part of NB model, I would not advice it on these issues.

– Alexandre C-L Jul 17 '19 at 14:55

Dealing with Overdispersed Negative Binomial using glmmTMB

2 Answers2

Linked