4

I'm new to the world of statistical modeling, but I was wondering if anyone had any input on how to handle over dispersed negative binomial data? I'm working on modeling bat activity as a response variable against a variety of insect, vegetation, and environmental variables. My objective is to see which explanatory variables (whether it be insect, vegetation, and/or environmental) are impacting bat activity the most.

My response variable is bat activity (count data) with an offset for # of survey nights the acoustic detectors ran for and is seemingly quite overdispersed. I've run Poisson models, all with the conclusion that they are overdispersed, so I've moved onto NB2 models using glmmTMB package; all predictor variables are scaled and centered. Below is the str of a few explanatory variables:

$ Year            :  Factor w/ 2 levels "2017", "2018": 1 1 1 1 1 1 1 1 1 1 1
 $ Habitat         : Factor w/ 4 levels "MCF","MM","MMF",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ Site            : Factor w/ 63 levels "MCF_001","MCF_002",..: 1 2 3 4 5 6 8 9 17 19 ...
 $ Bats            : int  4 1 47 61 5 14 7 84 6 3 ...
 $ Mylu            : int  3 0 38 13 0 1 0 6 4 0 ...
 $ Myse            : int  0 0 3 5 3 3 0 16 0 0 ...
 $ Survey.Nights   : int  4 5 6 4 4 4 5 4 4 5 ...
 $ Avg.Biomass     : num  -0.381 -0.481 0.908 -0.574 0.943 ...
 $ Shannon.Weaver  : num  -0.6412 0.0586 -0.2082 0.7039 0.7002 ...
 $ Num.Orders      : num  0.0711 -1.8912 0.0711 -1.8912 1.0522 ...
 $ Avg.Snags       : num  -0.851 1.837 0.224 0.493 -0.851 ...
 $ Avg.Understory  : num  -0.00711 -0.94428 3.51112 3.58282 0.55621 ...
 $ Avg.Midstory    : num  -0.35 0.255 -0.461 -0.589 -0.295 ...
 $ Avg.Canopy      : num  -1.056 0.692 1.129 1.129 0.911 ...
 $ Avg.Canopy.Cover: num  -0.822 0.514 1.182 0.982 1.182 ...
 $ Perc.Dec.Dom    : num  -0.491 -1.091 -1.942 -1.546 0.61 ...
 $ Avg.Bat.Date    : num  -0.7704 -0.9971 -0.2208 -0.2208 -0.0834 ...
 $ Avg.Bat.Night.Hr: num  -0.843 -0.951 -0.407 -0.429 -0.299 ...
 $ Avg.Bat.Temp    : num  0.5214 -0.5578 -1.0893 -0.2349 -0.0632 ...
 $ Bat.Dist.Edge   : num  -0.879 -0.432 -0.179 1.544 0.616 ...
 $ Bat.Elevation   : num  -0.741 -0.575 -0.12 -0.171 0.356 ...
 $ Bat.Moon        : num  0.667 -0.279 0.794 0.857 0.352 ...
nbin <- glmmTMB(Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory + 
    Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + Avg.Bat.Temp +
    Bat.Elevation + Bat.Moon + Bat.Water.Feat + Avg.Biomass + Num.Orders + 
    Avg.Bat.Temp*Avg.Bat.Date + Avg.Biomass*Year + Year + Habitat + 
    offset(log(Survey.Nights)) + (1|Site), 
    data = insect.data, 
    ziformula = ~0, 
    family = nbinom2)

summary(nbin)

Family: nbinom2 ( log ) Formula: Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + Avg.Midstory +
Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + Avg.Bat.Date + Avg.Bat.Temp + Bat.Elevation + Bat.Moon + Bat.Water.Feat + Avg.Biomass + Num.Orders + Avg.Bat.Temp * Avg.Bat.Date +
Avg.Biomass * Year + Year + Habitat + offset(log(Survey.Nights)) +
(1 | Site) Data: insect.data

 AIC      BIC   logLik deviance df.resid 
 539      588     -247      495       47 

Random effects:

Conditional model: Groups Name Variance Std.Dev. Site (Intercept) 2.44e-09 4.94e-05 Number of obs: 69, groups: Site, 36

Overdispersion parameter for nbinom2 family (): 2.47

Conditional model: Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.526 0.572 0.92 0.35763
Avg.Biomass -1.866 0.390 -4.78 1.7e-06 *** Num.Orders 0.876 0.136 6.44 1.2e-10 *** Avg.Understory 0.431 0.120 3.58 0.00034 *** Avg.Midstory -2.148 0.319 -6.72 1.8e-11 *** Avg.Canopy.Cover 0.465 0.190 2.45 0.01420 *
Perc.Dec.Dom 0.498 0.181 2.74 0.00606 ** Avg.Snags 0.694 0.142 4.88 1.1e-06 *** Avg.Bat.Date 0.110 0.169 0.65 0.51553
Avg.Bat.Temp -0.197 0.205 -0.96 0.33524
Bat.Elevation -0.360 0.126 -2.86 0.00429 ** Bat.Moon 0.541 0.111 4.85 1.2e-06 *** Bat.Water.FeatRiver -0.315 0.559 -0.56 0.57312
Bat.Water.FeatStream 7.018 1.330 5.28 1.3e-07 *** Year2018 0.169 0.312 0.54 0.58789
HabitatMM 0.185 0.383 0.48 0.62982
HabitatMMF 0.146 0.348 0.42 0.67448
HabitatREGEN 1.121 0.356 3.15 0.00164 ** Avg.Bat.Date:Avg.Bat.Temp -0.392 0.196 -2.00 0.04514 *
Avg.Biomass:Year2018 1.500 0.375 4.00 6.2e-05 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

enter image description here

res <- simulateResiduals(nbin)
plot(res,rank = T)

testResiduals(res)

data: simulationOutput ratioObsSim = 0.7, p-value = 0.4 alternative hypothesis: two.sided

> testResiduals(res) $uniformity

One-sample Kolmogorov-Smirnov test

data: simulationOutput$scaledResiduals D = 0.05, p-value = 1 alternative hypothesis: two-sided

$dispersion

DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data: simulationOutput ratioObsSim = 0.7, p-value = 0.4 alternative hypothesis: two.sided

$outliers

DHARMa outlier test based on exact binomial test

data: simulationOutput outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5 alternative hypothesis: two.sided

$uniformity

One-sample Kolmogorov-Smirnov test

data: simulationOutput$scaledResiduals D = 0.05, p-value = 1 alternative hypothesis: two-sided

$dispersion

DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data: simulationOutput ratioObsSim = 0.7, p-value = 0.4 alternative hypothesis: two.sided

$outliers

DHARMa outlier test based on exact binomial test

data: simulationOutput outLow = 0e+00, outHigh = 1e+00, nobs = 7e+01, freqH0 = 4e-03, p-value = 0.5 alternative hypothesis: two.sided][1]][1]

Then, I wanted to manually check dispersion and here's where I ran into some concerns

m1 <- nbin
dispfun <- function(m) {
        r <- residuals(m,type="pearson")
        n <- df.residual(m)
        dsq <- sum(r^2)
        c(dsq=dsq,n=n,disp=dsq/n)
}
options(digits=2)
dispfun(m1)

dsq n disp 76.1 47.0 1.6

This seems to indicate overdispersion in my model, however, I've already added covariates (as you can see, my model is quite complex and this is after dropping non-significant factors), and adding interactions (Hilbe 2011 suggestions). However, the DHARMa residuals look fairly decent. Which should I trust? Does anyone have any suggestions on how to handle this?

I reran with GLMMadaptive and got the following output with a different dispersion parameter:

Call:
mixed_model(fixed = Bats ~ Avg.Biomass + Num.Orders + Avg.Understory + 
    Avg.Midstory + Avg.Canopy.Cover + Perc.Dec.Dom + Avg.Snags + 
    Avg.Bat.Date + Avg.Bat.Temp + Bat.Elevation + Bat.Moon + 
    Bat.Water.Feat + Avg.Biomass + Num.Orders + Avg.Bat.Temp * 
    Avg.Bat.Date + Avg.Biomass * Yr + Num.Orders * Yr + Avg.Bat.Date * 
    Bat.Moon + Yr + Habitat + offset(log(Survey.Nights)), random = (~1 | 
    Site), data = insect.data2, family = negative.binomial(), 
    iter_EM = 300)

Data Descriptives: Number of Observations: 67 Number of Groups: 36

Model: family: negative binomial link: log

Fit statistics: log.Lik AIC BIC -230.2856 508.5711 546.5756

Random effects covariance matrix: StdDev (Intercept) 0.0514579

Fixed effects: Estimate Std.Err z-value p-value (Intercept) 0.7447 0.5482 1.3584 0.17434114 Avg.Biomass -1.5392 0.3861 -3.9871 < 1e-04 Num.Orders 0.4840 0.1862 2.5987 0.00935661 Avg.Understory 0.2471 0.1299 1.9023 0.05713095 Avg.Midstory -2.3953 0.3624 -6.6098 < 1e-04 Avg.Canopy.Cover 0.6657 0.1879 3.5422 0.00039685 Perc.Dec.Dom 0.5743 0.1737 3.3059 0.00094668 Avg.Snags 0.5411 0.1494 3.6217 0.00029270 Avg.Bat.Date -0.0040 0.1860 -0.0217 0.98266247 Avg.Bat.Temp -0.7496 0.2795 -2.6818 0.00732270 Bat.Elevation -0.3307 0.1270 -2.6032 0.00923670 Bat.Moon 0.5336 0.1206 4.4251 < 1e-04 Bat.Water.FeatRiver -0.7486 0.5586 -1.3402 0.18017727 Bat.Water.FeatStream 7.1474 1.4996 4.7663 < 1e-04 Yr2018 0.4797 0.3066 1.5643 0.11774826 HabitatMM -0.0861 0.3768 -0.2285 0.81928969 HabitatMMF -0.3509 0.3605 -0.9735 0.33030629 HabitatREGEN 1.0362 0.3399 3.0486 0.00229947 Avg.Bat.Date:Avg.Bat.Temp -0.6803 0.2172 -3.1324 0.00173393 Avg.Biomass:Yr2018 1.1956 0.3758 3.1815 0.00146534 Num.Orders:Yr2018 0.6276 0.2661 2.3584 0.01835350 Avg.Bat.Date:Bat.Moon 0.3587 0.1782 2.0130 0.04411454

log(dispersion) parameter: Estimate Std.Err 1.0421 0.2256

Integration: method: adaptive Gauss-Hermite quadrature rule quadrature points: 11

Optimization: method: hybrid EM and quasi-Newton converged: TRUE

My main objective is to see which explanatory variables could be driving bat activity in my study area and whether bat activity can be predicted based off of insect activity, different vegetative characteristics, and/or environmental factors.

dwash7
  • 101
  • 2
  • 5
  • What is your main objective ? – Alexandre C-L Jul 16 '19 at 16:47
  • @AlexandreCazenave-Lacroutz My main objective is to see which explanatory variables could be driving bat activity in my study area and whether bat activity can be predicted based off of insect activity, different vegetative characteristics, and/or environmental factors. – dwash7 Jul 16 '19 at 17:46

2 Answers2

4

A couple of points:

  • The variance of the random effect for site is extremely low. This could either mean that there is no correlations in the bat activity within a site or that could be an artefact of the Laplace approximation used behind glmmTMB() to approximate the integrals of the random effects. You could also try fitting the same model with the GLMMadaptive package that approximates the same integrals with the adaptive Gaussian quadrature procedure that can be more accurate. You can find examples here and here.
  • It would be better to check the fit of the model, and possible remaining over-dispersion using the scaled simulated residuals of the DHARMa package. An example using this package to check the fit of a negative binomial model can be found here.
  • It would be better to define the variables as factors beforehand and not inside the formula. Moreover, are you sure that you need all these interaction terms?
Dimitris Rizopoulos
  • 21,024
  • 2
  • 20
  • 47
  • thanks for the reply. I've been trying to install DHARMa and been running into countless install issues, so I have not been able to try that yet. In Re: to your first bullet, I have Site incorporated so that I do not violate independence. See edit for output when rerunning with GLMMadaptive. – dwash7 Jul 17 '19 at 13:40
  • @Dmitris Rizopoulos also, I added the interactions in efforts to fix any apparent overdispersion (using suggestion from Hilbe 2011) – dwash7 Jul 17 '19 at 14:04
  • @dwash7 the models has not converged because it says converged: FALSE in the output. Try setting mixed_model(..., iter_EM = 0). – Dimitris Rizopoulos Jul 17 '19 at 17:59
  • @Dmitris, I updated the above post with your suggestion and it still has not converged – dwash7 Jul 17 '19 at 18:13
  • @dwash7 it is a bit difficult without having the data to try, but check whether iter_EM = 300 helps. – Dimitris Rizopoulos Jul 17 '19 at 18:22
  • thanks for your patience! That did the trick to fix the convergence issues (updated above) – dwash7 Jul 18 '19 at 11:10
1

If you are interested only in getting the respective impact of each variables...you may keep the Poisson specification (with Robust Standard errors), despite any overdispertion.

Indeed, when the assumption $E(Y|X)=e^{X\beta}$ is a reasonnable assumption (which is supposed for both the Poisson and the Negative Binomial), the Poisson estimates of the $\beta$ are consistent, whatever the underlying dispertion. (Because there is in fact no need to use the hypothesis that variance equals the mean, see Wooldridge, J. M. (1999). Quasi-likelihood methods for count data. Handbook of applied econometrics, 2.; https://www.researchgate.net/publication/247320048_Quasi-Likelihood_Methods_for_Count_Data )

A blog entry discussing it is: https://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/ (please read the "Finally, I would like to tell you that everyone" and the following sections)

  • I've been using Zuur et al. 2009 and Hilbe 2011, both of which suggest that NB is a solution to overdispersed count data. I've run a few quasi-Poisson models, but the newer packages don't support quasi-families, so I assumed they were inferior to using NB. Am I misunderstanding something? Even the blog did seem to come to the conclusions that NB seems to do fine with overdispersed data. Please, correct me if I am misunderstanding. – dwash7 Jul 17 '19 at 13:52
  • "the newer package" ?
  • In my understanding:
  • Poisson is theoretically a better choice to get consistent estimates of the $\beta$. The provided blog entry gets that NB does not do a bad job in one specific example. But that this question was raised by the author of the blog shows that the common sens: "Oh, there is overdispersion, I should use a NB rather than a Poisson" is not adequate. If you consider a specification of the form $E(Y|X)=e(X\beta)$ (as for NB and for Poisson) in order to get the $\beta$, your first choice should be Poisson. (...)

    – Alexandre C-L Jul 17 '19 at 14:55
  • (...) Overdispersion is not what should make you choose NB rather than Poisson for your purpose.
    • As I have the feeling (some pages are missing on Google Book) that Hilbe (2011) does not quote "Wooldridge, J. M. (1999). Distribution-free estimation of some nonlinear panel data models. Journal of Econometrics, 90(1), 77-97." in a whole book on NB models, whereas it states in the introduction that Poisson is part of NB model, I would not advice it on these issues.
    – Alexandre C-L Jul 17 '19 at 14:55