1

I'm trying to fit the data with 0s with Poisson, Negative Binomial, Zero-inflated Poisson and Zero-inflated negative binomial and discuss which model is more desired. When I'm trying to calculate the zero proportion from the model, I found that ZIP models the zero proportion exact the same with the original zero proportion. So my questions are:

  1. Is there a reason to explain this situation? I would assume ZIP is overfitting since the AIC is not the smallest but I'm not sure about it.

  2. If ZIP is overfitting, is there a way I can prove it? I don't think I can use a test dataset since I only have the intercept.

library(MASS)
library(pscl)

arr = c(rep(0,2387), rep(1,273), rep(2,36), rep(3,3), rep(4,3))

po = glm(arr ~ 1, family = 'poisson') mu_po = exp(coef(po))

nb = glm.nb(arr ~ 1) mu_nb = exp(coef(nb)) alpha_nb = 1/nb$theta

zpo = zeroinfl(as.numeric(arr) ~ 1|1, dist = "poisson") mu_zpo = exp(coef(zpo)[1]) pi_zpo = exp(coef(zpo)[2])/(1+exp(coef(zpo)[2]))

znb = zeroinfl(arr ~ 1|1, dist = "negbin") mu_znb = exp(coef(znb)[1]) pi_znb = exp(coef(znb)[2])/(1+exp(coef(znb)[2])) alpha_znb = 1/znb$theta

#zero proportion data.frame(true = sum(arr==0)/length(arr), poisson = exp(-mu_po), NB = (1+alpha_nbmu_nb)^(-1/alpha_nb), ZIP = pi_zpo+(1-pi_zpo)exp(-mu_zpo) , ZINB = pi_znb+(1-pi_znb)* (1+alpha_znb*mu_znb)^(-1/alpha_znb)) #AIC data.frame(poisson = AIC(po), NB = AIC(nb), ZIP = AIC(zpo), ZINB = AIC(znb))

1 Answers1

1

I think it's expected that the zero-inflated Poisson model fits the original zero proportion exactly, because the $\pi$ parameter exists solely to inflate $\mathbb{P}(X=0)$ to the desired value.

The process of fitting a ZIP model can be thought of like this:

  1. Ignoring all zero observations, find the value of $\lambda$ which best fits the rest of your data.
  2. Choose $\pi$ to solve $\pi + (1-\pi)e^{-\lambda} = \frac{z}{n}$, where $z$ is the number of observed zeros and $n$ is the sample size.

Step 2 will exactly fit the observed zero proportion, which matches what you saw.

This fails if the value of $\lambda$ you got from step 1 is low enough that even if you set $\pi = 0$, the model's $\mathbb{P}(X=0)$ is already greater than $\frac{z}{n}$. But if that were the case, you wouldn't have observed enough zeroes to justify using a zero-inflated Poisson model.

We can check the algebra directly for maximum likelihood estimators. Wikipedia gives equations satisfied by the maximum likelihood estimators for $\pi$ and $\lambda$:

$${\displaystyle {\hat {\pi }}_{ml}=1-{\frac {m}{{\hat {\lambda }}_{ml}}}}$$

$${\displaystyle m(1-e^{-{\hat {\lambda }}_{ml}})={\hat {\lambda }}_{ml}\left(1-{\frac {n_{0}}{n}}\right)}$$

Where $m$ is the sample mean and $\frac{n_0}{n}$ is the observed proportion of zeros.

If you plug these into the PMF $P(X = 0) = \pi + (1-\pi)e^{-\lambda}$, you get $\frac{n_0}{n}$, i.e. the MLE estimators exactly predict the observed proportion of zeros, as expected.

See also An Illustrated Guide to the Zero Inflated Poisson Regression Model.

fblundun
  • 3,959
  • Thank you so much for the answer. I have a follow up question. I understand your explaination for ZIP, I assume fitting a ZINB model would also follows the two steps: 1. fit NB with non-zero value, 2. fit the zero proportion. However, the performance of "predicting" zero proportion using ZINB is a little worse, at least not exactly the same with the original. I'm confused about this. Also, I think ZINB should be more flexible than ZIP, shouldn't it perform better than ZIP on modeling the zero proportion? Thanks in advance! – Kathleen Dec 17 '20 at 20:22
  • I think I just figured it out! For ZINB, NB will account for some of the zeros, so there are still difference between the zero proportion ZINB modeled and the original one. Thanks again! – Kathleen Dec 17 '20 at 22:19
  • Are you sure that explains it? The non-zero-inflated Poisson distribution also has some zeros of its own. I do find it surprising that the ZINB model doesn't exactly predict the sample zero proportion. – fblundun Dec 17 '20 at 22:29
  • As far as I can tell the relevant code is here, if you wanted to dig in to what pscl is doing. – fblundun Dec 17 '20 at 22:34
  • I'm not sure actually, but this is the only reason I can think of. Also NB are used to model data with lot of 0s: Do We Really Need Zero-Inflated Models?. But I'll also check the code. Thanks again! – Kathleen Dec 17 '20 at 22:56
  • @Kathleen I made an edit showing that that the MLE estimators for $\pi$ and $\lambda$ are expected to exactly fit the observed zero proportion. Is it possible that your ZINB model isn't using maximum likelihood estimation? – fblundun Dec 21 '20 at 23:04
  • Do you mind showing me the derivation? I think both ZIP and ZINB are using the same algorithm. I still feel it's NB that generates the extra zeros. Here's a small simulation I did that I think may proof this. – Kathleen Dec 22 '20 at 03:30