Theoretical justification for using a zero-inflated count model

Question

I have a theoretical question regarding the use of zero-inflated models. There are similar questions here and here, but neither answer set seems to deal with the theoretical question I am asking.

I understand that the theoretical use of zero-inflated models is that when you have a bunch of zeros in count data, it is sometimes useful to conceive of those zeros coming from two different generating processes: (a) one where there just "happened to be" zeros generated as a function of the normal count process, and (b) one where there was nothing in the count process that could have produced a count, i.e. it is impossible to have a count. The example Paul Allison gives here: "Of course, there are certainly situations where a zero-inflated model makes sense from the point of view of theory or common sense. For example, if the dependent variable is number of children ever born to a sample of $50$-year-old women, it is reasonable to suppose that some women are biologically sterile."

My trouble with this common explanation is that I can't identify when to draw the line and use a not-zero-inflated model even when there are many zeros. For any given count variable, I can imagine a scenario likely to have been present in the data where it would be functionally impossible to get a count. With a big enough dataset, it is likely that you're going to have observations where the conditions were such that it was functionally impossible to have counts. In ecological studies where you're predicting the number of a given species found in a given location, is it not always the case that a zero might be a function of ecological conditions that make it impossible for the species to have been observed? In sociological studies where you're predicting the number of crimes someone has committed, is it not always the case that a zero could be a function of someone not having the socioeconomic or psychological conditions that would make crime functionally possible?

I'm particularly getting mixed up because theoretically, in these studies, we're predicting the presence or abundance of some phenomena, but deciding on the model based on whether we think the phenomena is sometimes going to be basically impossible for reasons outside of the predictors we've collected. But if I really think about it, we can probably come up with a not-so-contrived example of how counts would be impossible for a great number of count distributions with a lot of zeros in them, which seems like it would lead us to saying we should always be theoretically justified in using zero-inflated models for count data with a lot of zeros, right? Any resources or clarifying responses would be greatly appreciated.

EDIT: I think re-reading all of everyone's responses, I'm realizing my cognitive block is that I don't particularly conceptualize the excess zero generating process as being that usefully separated from the count generating process. For the majority of use cases I can think of, any conditions producing an excess zero aren't necessarily deterministic, in that even if they dramatically reduce the probability of a non-zero count, they don't actually eliminate the possibility of a count. And for zeros that were generated by the count process, I don't know where to draw the line conceptually to say that an observation with value 0, where the probability of it being 0 is really high, is not an "excess" zero. But I take @Adrian's overall feedback that this is an example of where there is some epistemological fun here, and it's not a matter of "correct" as much as it's about how we want to model the phenomena.

I would frame this instead as: both models are wrong (e.g., the poisson and the zero-inflated poisson -- neither one is exactly correct outside of a theoretical or simulated setting), one model is more complex than the other, and which model is preferable will depend on both your objective and on the data. If your objective is to make a probabilistic forecast, use a test set to choose between the two models. If your goal is parameter inference, that's another way to guide your choice. Don't get hung up on the fact that both models are wrong, even if the more complex models seems "less wrong." — Adrian, Nov 24 '23 at 19:39
Thank you for this framing: I appreciate what it does for me emotionally on this question. Could you expand on what you mean in the case when the goal is parameter inference? I'm not a data scientist, but an applied statistician, so my objective is generally trying to make the most theoretically useful choice for being able to explain a generating process, and I guess what I'm trying to work through is if my intuition that a zero-inflated model will generally feel more theoretically consistent is off-base since I'm used to theoretical dilemmas that feel more debatable than this one. — RickyB, Nov 24 '23 at 20:05
It's not the kind of "theoretical" justification you have in mind, but one could just compare the fits and see whether zero-inflation fits clearly better. https://stats.stackexchange.com/questions/118322/how-to-test-for-zero-inflation-in-a-dataset — Christian Hennig, Nov 25 '23 at 01:42

Shawn Hemelstrand · Answer 1 · 2023-11-25T00:32:29.617

The key distinction here between using a zero-inflated model or not is already in this part of your question:

It is sometimes useful to conceive of those zeros coming from two different generating processes: (a) one where there just "happened to be" zeros generated as a function of the normal count process, and (b) one where there was nothing in the count process that could have produced a count, i.e. it is impossible to have a count.

As Adrian noted, there is not automatic way to determine if this is "right" to do or not, but more based on our best judgement that aligns your statistical model with your theoretical one. I can give a useful example here from my own experience.

Over the past 2.5 years I have collected data on my daily habits (hours worked, time started work, etc.). One of the relationships I was interested in early on was the relationship between coffee consumption and minutes of work I generated in a day. The relationship in my data looks like this:

Ignoring some of the issues related to this dataset (measurement precision of "cups", etc.), one can see that both variables here have several zeroes, but why that is the case is quite clear from my personal experience...having coffee each day is a fairly random draw of data. Sometimes I drink coffee, other times I have a lot, sometimes I have no coffee. There is really nothing generating that data process beyond just how much I feel I want to drink that day. However, the minutes of work I produce are generated by two distinct processes:

Days that I choose not to work.
Days that I choose to work.

Here there is a hardline distinction. How I act when I'm working is not the same as when I'm on vacation or simply enjoying my weekend. As an example, I'm more likely to be around a coffee pot when I'm in the office, which may motivate me to drink it being that it is in closer proximity and may motivate my work. That really isn't the case on days that I don't work.

In terms of statistical procedures to determine heavily zero-inflated data, a common one is to compare the number of zeroes in your data to the number of zeroes predicted by your regression model. When the zero counts are extremely high in your data but is rarely predicted by your model, it may be a good justification to use zero-inflated regression.

As for a reference, Chapter 11 of Hilbe's Negative Binomial Regression book has a sizeable chapter on the theoretical and practical underpinnings of the zero-inflated model.

Thank you for the personal example. I think what I'm getting tripped up on is how to consider the third option, which is basically all the times you maybe didn't make a strong choice or maybe intended to work but for some random reason, it just didn't happen for you. Even in the (arguably unhealthy) event that you are a person who wakes up every day intending to work, will there not always be some circumstances where independent of coffee consumption, it is functionally impossible for you to work? — RickyB, Nov 25 '23 at 18:41
I think the argument would still stand even if there wasnt a strong decision being made to work or not. Those days are still fundamentally different from days I work. The mere fact that I am not being exposed to a work environment on those days may in many ways change the nature of what I do on that day. For example, I may hate some coworker. Could not seeing them that day, chosen or not, affect how I act on a given day? Not being exposed to work pressures could influence many facets of that day, and so I still believe its a different functional process. — Shawn Hemelstrand, Nov 25 '23 at 23:25

score 5 · Answer 2 · edited Nov 25 '23 at 00:09

5

I think it’s more a matter of what the scientist is trying to uncover. There’s an example in the book Statistical Rethinking of Monks at momentary, who principally make beers while working. If you fit a vanilla Poisson distribution, you might infer how many beers get produced in one-week and the number of days the monks have off is baked into this estimate. The zero-inflated Poisson approach assumes the monks have zero productivity on some days (off) and nonzero productivity on other days (working.)

If the monks will always have a certain number of days off and this is not subject to change, then use the vanilla model out of simplicity.

However, suppose you’re interested in understanding how many beers will be produced with more more day of working or one less day of working. Now the zero-inflated approach is of use.

So it’s more about context / goals than a binary right/wrong answer.

edited Nov 25 '23 at 00:09

Shawn Hemelstrand

13,543

answered Nov 24 '23 at 23:56

jbuddy_13

3,000

1

That's strangely close to the example I gave. I'm having trouble finding that book. Do you mean the book Statistical Rethinking by McElreath? – Shawn Hemelstrand Nov 25 '23 at 00:08
2

Exactly! See ch 12, monsters and mixtures – jbuddy_13 Nov 25 '23 at 00:18
I do have this book but have to find it in a book, so feel free to tell me if this question is addressed there: Your suggestion to use the vanilla model in the case when the days off is consistent is my question. Even in that case, I can imagine other time-varying reasons why no beer would have been produced (e.g., broken machinery). Then, wouldn't I want that to be modeled in the zero-inflation? – RickyB Nov 25 '23 at 18:36
(typo above, meant to say "in a box") – RickyB Nov 25 '23 at 18:47
@RickyB, certainly! Anytime you want to decouple on/off from rate, then zero inflated is of interest. In your use case, assumedly you’d want to understand how many x could be produced if you could reduce off time. – jbuddy_13 Nov 25 '23 at 22:01
Thank you for this comment, it has helped me reframe my thoughts. I'd edited the post. – RickyB Nov 29 '23 at 15:36

Theoretical justification for using a zero-inflated count model

2 Answers2