4

I want to model counts of an event in a pre-post design. A sample dataset could look like this:

dat <- tibble::tibble(
  day = 1:20,
  event = c(0,0,2,5,0,10,3,0,0,0,1,3,4,0,5,0,0,2,0,10),
  group = c(rep("pre", 10), rep("post", 10))
)

In my real data there are definitely too many zeros for a Poisson process. Thus I am leaning towards fitting three models (Poisson, negative binomial, zero-inflated poisson, zero-inflated negative binomial), performing model comparison, and then performing inference on the best model. However, I am not sure if my approach is valid.

This is what I would like to do:

# fit poisson model
m1 <- glm(event ~ group, family = "poisson", data = dat)

get AIC for m1

m1.aic <- AIC(m1)

fit neg binom model

m2 <- MASS::glm.nb(event ~ group, data = dat)

get AIC for m2

m2.aic <- AIC(m2)

fit zero-inflated neg binom model

m3 <- pscl::zeroinfl(event ~ group | group, dist = "negbin", data = dat)

get AIC for m3

m3.aic <- AIC(m3)

fit zero-inflated poisson model

m4 <- pscl::zeroinfl(event ~ group | group, dist = "poisson", data = dat)

get AIC for m4

m4.aic <- AIC(m4)

summary of lowest AIC model

summary(m2)

conclusion: no significant effect of group on number of events

check predicted mean number of events per group

emmeans::emmeans(m2, ~ group, type = "response")

Do you see any problems with this? What would you do differently?

Dave
  • 62,186
  • 2
    I think your question deals more with the question I've just posted than what I wrote in my answer, but what I wrote applies no matter how you do your model selection. – Dave Jul 26 '22 at 14:59
  • 1
    Hi, I'm just here to say that just cause you have a lot of zeros doesn't mean you need a zero inflated model; there's no such thing as "too many zeros for a Poisson process", because a Poisson process will happily spit out as many zeros as you want so long as its intensity function is suitably small. Of course, if you cannot explain the zeros with your fixed effects, at this point a zero inflated model may be appropriate. – John Madden Jul 26 '22 at 19:04

2 Answers2

6

By doing model selection before doing inference on the selected model, you are distorting your final inferences. Any inference you do on the final model, say calculating a confidence interval on a model parameter, assumes that you specified such a model and fit it, not that you cherry picked the model from several candidates and then ran inference. If you're familiar with the issues with stepwise regression (especially 2, 3, 4, and 7), the same ideas apply.

Dave
  • 62,186
  • Thanks for you input! Perhaps you're interested in this twitter discussion sparked by the same question (check replies and replies to quote tweets) – Vasco Brazao Jul 26 '22 at 14:16
  • @Dave I agree with your answer, but would you be able to propose a solution to this problem? How can OP deal with it, since he certainly needs to compare the models to select the one which has the best fit to his data? – FP0 Jul 27 '22 at 08:41
  • Is there a (probably crude) method to adjust the estimates for having tested (/cherry-picked) several different first before calculating the estimates? – user2296603 Jul 27 '22 at 11:26
  • @user2296603 That would be a wonderful topic for a new question to post. – Dave Jul 27 '22 at 11:35
3

In principle this is fine. The one issue is that many packages will drop constants from their likelihood computations, which means they aren’t comparable anymore. So calculate the likelihoods by hand from the formula to check.

I believe that glmmTMB and brms report full likelihoods/AIC/LOOIC for each of these families, so you could just use one of those packages rather than the collection of various ones