Confused on how to interpret ZINB and Hurdle models

Question

Below are a set of results from a zero-inflated (AIC = 64992.15; BIC = 65280.78) and hurdle model (AIC = 65141.73; BIC = 65430.36) for a dataset (>18000 obs) which describes the number of illicit firearms being purchased in one state and recovered by law enforcement in another. Using variables for each state (e.g. GDP, restrictive firearm laws, distance between states, and whether or not the states share a border and its length) I want to test whether positive counts of illicit weapon flows (F) and zero F are affected by such variables, and the level of influence each year between 2010 and 2017.

From what I've read so far, it seems like I should interpret from the ZINB that, when F = 0, for every increase in distance between states the odds of seeing zero F decreases by 61%. This part makes me think I'm interpreting it completely wrong, I was originally assuming that a greater distance would influence the F to drop, and now I'm in need of some guidance on how to read the model.

library(pscl)
zinb <- zeroinfl(Y ~ lCTD + BL + lGDPi + lGDPj + LIi + LIj + SE1i+ SE1j + SE2i + SE2j + yr, 
             data = modeldata, 
             dist = "negbin")
hdnb <- hurdle(Y ~ lCTD + BL + lGDPi + lGDPj + LIi + LIj + SE1i+ SE1j + SE2i + SE2j + yr, 
            dist = "negbin", 
            zero.dist = "binomial", 
            data = modeldata)


AIC(zinb, hdnb)
BIC(zinb, hdnb)

summary(zinb)
summary(hdnb)

fm <- list("ZINB" = zinb, "Hurdle-NB" = hdnb)
t(sapply(fm[1:2], function(x) round(x$coefficients$count, digits = 3)))
t(sapply(fm[1:2], function(x) round(exp(x$coefficients$count), digits = 3)))

library(countreg)
rootogram(zinb, main = "Zero-Inflated Negative Binomial", ylim = c(-15, 100), max = 25)
rootogram(hdnb, main = "Negative Binomial Hurdle", ylim = c(-15, 100), max = 25)
qqrplot(zinb, main = "Zero-Inflated Negative Binomial")
qqrplot(hdnb, main = "Negative Binomial Hurdle")

What are the goodness of fit for these models? What is their AIC? Rootograms? What's going on with the "hurdle/zero-inflation"' part of the model? Are they very similar? For the application described zero-inflation appears more appropriate to me at first look; mostly on the assumption that the zeros correspond to a situation that we want to measure the flow but we failed to do so. — usεr11852, Mar 21 '20 at 14:10
I was thinking the ZINB was more appropriate too, but I initially thought it would be useful for comparative analysis to include both. But they do appear similar, I included Q-Q Plots for Quantile Residuals (countreg::qqrplot) along with the count and zero components of the models. — James R., Mar 21 '20 at 15:00
Thank you for these additions. Both the QQ plots and the rootograms appears qualitative similar to me. Maybe the Hurdle one is slightly "better" but realistically not much. That said, why you question your interpretation about distance. I can definitely believe that the further two states are, the less likely they are to have a strong flow. I think that there might be some confusion regarding the interpretation of the zero model; it is effectively a binomial of having flow (i.e. F >0). — usεr11852, Mar 21 '20 at 17:06
Just as an example: data("bioChemists", package = "pscl"); summary(hurdle(art ~ phd | fem + mar, data = bioChemists)); summary(glm(family = binomial, art>0 ~fem + mar, data = bioChemists)) where we can see that the zero hurdle model coefficients are effectively the same we get from a standard binomial GLM. — usεr11852, Mar 21 '20 at 17:08
In my mind, I thought the model was saying that it was less likely to see no flows between states as the distance between them increased. — James R., Mar 21 '20 at 17:30
Sorry for any further confusion, but with that interpretation, wouldn't that mean an increase in Border length would lead to a decrease in border flow by 98%? — James R., Mar 21 '20 at 17:33
Nope. It suggests that the odds became just marginally smaller. From having no difference what so ever (1), to be a tad smaller (0.98). — usεr11852, Mar 21 '20 at 17:48
Excellent, thank you so much taking the time to explain that to me. — James R., Mar 21 '20 at 17:56
Cool. I am always happy to help. I will combine these posts to a post later tonight so we have a coherent answer. — usεr11852, Mar 21 '20 at 18:03
Please see my answer below! I think we got confused, I started talking about the hurdle model, you referenced the ZI model and I didn't notice it. I fully clarify things in my answer. — usεr11852, Mar 21 '20 at 22:26

usεr11852 · Accepted Answer · 2020-03-22T00:11:33.487

The two models presented appear to have QQ plots and the rootograms that are qualitative similar. The Hurdle one is slightly "better" but realistically not much. I believe that either is "fine" and it is more of a question which fits the research question at hand better. CV.SE has a good thread on the matter: What is the difference between zero-inflated and hurdle models?. Succinctly: in hurdle models we model the probability of having a zero or a non-zero outcome while for zero-inflated models we model the probability of having a zero outcome not being part of the original the uninflated Negative Binomial distribution.

OK, so now to interpreter the coefficients: In both cases we have a zero-inflation/hurdle model coefficients coming from a binomial GLM with logit link. Focusing on the border length (BL) as in the comments: In the case of a ZI model as BL increases, it becomes marginally less likely that any zero outcomes observed are part of the NB that models the counts, the odds go from $1$ (no difference whatsoever) to $0.98$. On the other hand, in the case of Hurdle model, as BL increases, it becomes marginally more likely that we have non-zero outcomes, the odds go from $1$ (no difference whatsoever) to $1.005$. Finally to clarify the ask about distance increase in the ZI model: as the distance between two states increases, the odds that the zero-outcome being part of the NB count distribution decreases by $0.61$; indeed this seems a bit counter-intuitive to me too. I would note here though that the ZI model effectively ties it's zero-inflation base-line on the two states GDP so probably there is a strong confounding effect from that. In a more straight-forward manner, the Hurdle model suggests that as the the distance increases it less likely that we have a non-zero outcome; I find this easier to explain (i.e. inter-state probability of flow is inversely related to the distance between the states).

Brilliant, I think that explanation is as clear as it can be given the confusing results of the distance variable in the ZINB. Thank you again. — James R., Mar 21 '20 at 23:10

Confused on how to interpret ZINB and Hurdle models

1 Answers1

Linked