Valid model comparison/selection? Poisson, negative binomial, zero-inflated poisson, z-i negative binomial with AIC

Question

I want to model counts of an event in a pre-post design. A sample dataset could look like this:

dat <- tibble::tibble(
  day = 1:20,
  event = c(0,0,2,5,0,10,3,0,0,0,1,3,4,0,5,0,0,2,0,10),
  group = c(rep("pre", 10), rep("post", 10))
)

In my real data there are definitely too many zeros for a Poisson process. Thus I am leaning towards fitting three models (Poisson, negative binomial, zero-inflated poisson, zero-inflated negative binomial), performing model comparison, and then performing inference on the best model. However, I am not sure if my approach is valid.

This is what I would like to do:

# fit poisson model
m1 <- glm(event ~ group, family = "poisson", data = dat)
get AIC for m1
m1.aic <- AIC(m1)
fit neg binom model
m2 <- MASS::glm.nb(event ~ group, data = dat)
get AIC for m2
m2.aic <- AIC(m2)
fit zero-inflated neg binom model
m3 <- pscl::zeroinfl(event ~ group | group, dist = "negbin", data = dat)
get AIC for m3
m3.aic <- AIC(m3)
fit zero-inflated poisson model
m4 <- pscl::zeroinfl(event ~ group | group, dist = "poisson", data = dat)
get AIC for m4
m4.aic <- AIC(m4)
summary of lowest AIC model
summary(m2)
conclusion: no significant effect of group on number of events
check predicted mean number of events per group
emmeans::emmeans(m2, ~ group, type = "response")

Do you see any problems with this? What would you do differently?

I think your question deals more with the question I've just posted than what I wrote in my answer, but what I wrote applies no matter how you do your model selection. — Dave, Jul 26 '22 at 14:59
Hi, I'm just here to say that just cause you have a lot of zeros doesn't mean you need a zero inflated model; there's no such thing as "too many zeros for a Poisson process", because a Poisson process will happily spit out as many zeros as you want so long as its intensity function is suitably small. Of course, if you cannot explain the zeros with your fixed effects, at this point a zero inflated model may be appropriate. — John Madden, Jul 26 '22 at 19:04

score 6 · Answer 1 · answered Jul 26 '22 at 13:34

6

By doing model selection before doing inference on the selected model, you are distorting your final inferences. Any inference you do on the final model, say calculating a confidence interval on a model parameter, assumes that you specified such a model and fit it, not that you cherry picked the model from several candidates and then ran inference. If you're familiar with the issues with stepwise regression (especially 2, 3, 4, and 7), the same ideas apply.

answered Jul 26 '22 at 13:34

Dave

62,186

Thanks for you input! Perhaps you're interested in this twitter discussion sparked by the same question (check replies and replies to quote tweets) – Vasco Brazao Jul 26 '22 at 14:16
@Dave I agree with your answer, but would you be able to propose a solution to this problem? How can OP deal with it, since he certainly needs to compare the models to select the one which has the best fit to his data? – FP0 Jul 27 '22 at 08:41
Is there a (probably crude) method to adjust the estimates for having tested (/cherry-picked) several different first before calculating the estimates? – user2296603 Jul 27 '22 at 11:26
@user2296603 That would be a wonderful topic for a new question to post. – Dave Jul 27 '22 at 11:35

score 3 · Answer 2 · answered Jul 26 '22 at 13:08

In principle this is fine. The one issue is that many packages will drop constants from their likelihood computations, which means they aren’t comparable anymore. So calculate the likelihoods by hand from the formula to check.

I believe that glmmTMB and brms report full likelihoods/AIC/LOOIC for each of these families, so you could just use one of those packages rather than the collection of various ones

Valid model comparison/selection? Poisson, negative binomial, zero-inflated poisson, z-i negative binomial with AIC

get AIC for m1

fit neg binom model

get AIC for m2

fit zero-inflated neg binom model

get AIC for m3

fit zero-inflated poisson model

get AIC for m4

summary of lowest AIC model

conclusion: no significant effect of group on number of events

check predicted mean number of events per group

2 Answers2