5

I have two zero-inflated datasets such as,

dt1= 0, 0.1, 0.125, 0, 0, 1.25... 
dt2= 1.01, 0, 0, 0.25, 0,...

I want to check the differences, like t.test for instance, how can I compare these two datasets?

Stephan Kolassa
  • 123,354
  • 1
    A minor point about terminology: I would argue that data themselves are not zero-inflated in the same way that data are not "non-parametric". Zero-inflation is a term that has meaning with respect to a certain model such as the Poisson. – COOLSerdash Jul 29 '20 at 09:26

2 Answers2

4

You can probably use the standard $t$ test to compare means of zero inflated datasets. Unless you know what you are doing, I would use a $t$ test that does not assume equal variances.

As an illustration, let's simulate some zero inflated data, where $X=0$ with probability $0.8$ and $X\sim\Gamma(2,2)$ otherwise, like this:

sample

Even with such a high amount of zero inflation, the mean of $n=100$ samples is nicely almost-normally distributed, which is what the $t$ test requires:

means

You may want to bootstrap means within each group, plot them, and eyeball them to reassure yourself whether the histogram is nicely normal.

R code:

n_sims <- 1e5
n_sample <- 100
means <- rep(NA,n_sims)
for ( ii in 1:n_sims ) {
    set.seed(ii)    # for reproducibility
    zeros <-    runif(n_sample)<0.8
    foo <- c(rep(0,sum(zeros)),rgamma(sum(!zeros),2,2))
    means[ii] <- mean(foo)
}
hist(foo,main="Sample zero inflated dataset",xlab="")
hist(means,xlab="")

Whether such a comparison of means is useful and informative in the context of zero inflation is a different question. Consider also comparing the proportion of zeros. Or fitting a appropriate mixture models and comparing the respective components.

Stephan Kolassa
  • 123,354
  • 2
    Would the downvoter be so kind to explain what about my answer is not useful? – Stephan Kolassa Jul 29 '20 at 10:09
  • Hi, Stephan. Could you explain why mean comparison wouldn't be useful and informative in this context? What would you propose instead? – Parseval Dec 02 '21 at 12:12
  • @Parseval: it depends on what question you are interested in. If your zero inflated dataset is 90% zeros, then the overall mean is very much dominated by this, and you may be more interested in the mean of the nonzero entries. Or in quantiles. For instance, your data may be responses to some marketing campaign, where most targets do not respond at all, and you are more interested in the actual responses (= nonzero purchases, clicks or whatever). – Stephan Kolassa Dec 02 '21 at 12:17
  • 1
    Indeed that is my case. An A/B test between two groups. Group A has been exposed to a marketing campaign and group B has not. I want to find if there is any difference in their total spending as a result of the campaign that lasted a given period of time. Obviously both groups contain lots of zeros since the majority of the customers place zero orders (many are one time purchasers). – Parseval Dec 02 '21 at 12:21
  • @Parseval: exactly. So you might also be interested in comparing the proportion of zeros (non-responders) between the two groups. – Stephan Kolassa Dec 02 '21 at 12:56
  • What is the name of such a test? Or do I only do a permutation test and construct a distribution of the proportions of 0 and check where the observed distributions land? – Parseval Dec 02 '21 at 13:39
  • @Parseval: I don't know of a specific test on the proportions of zeros. You could do a simple $\chi^2$ test on a table of zeros vs. non-zeros: https://en.wikipedia.org/wiki/Pearson's_chi-squared_test. A permutation test would be a reasonable alternative. – Stephan Kolassa Dec 02 '21 at 15:05
0

Instead of getting distribution of mean (from bootstrap samples) would it be more appropriate to consider the skewed distribution in 2 parts: and get distribution of P(data > 0) * median(data after removing zeros) ?

  • This seems like it is better left as a comment than an answer (as the solution to their problem here isn't clear and could perhaps use some elaboration/citations to support your point). Of course you do not have enough reputation yet to do that, but regardless this is not adequate (in my opinion) as an answer to the query. – Shawn Hemelstrand Jan 25 '24 at 03:25