When is it appropriate to keep/remove zero values from a dataset (one-Way ANOVA)?

Question

It has been a few years since I've had to use the stats I learned in school and I'm not sure what the right approach to this is anymore.

In my data set I have 4 different groups (tomato plants that receive different products). I would like to look at the number of fruit set between the groups using a one-way ANOVA. Fruit is harvested once a week when ripe, so this means some plants have "0" fruits in a given week. The zero does not necessarily mean the plant has no fruit at all, just that there were no ripe tomatoes ready to be picked that day.

Is it okay for me to remove these zero values (meaning each group might have a different sample size. Including zeros, N = 50 for all 4 treatments if that is important) or is it better if they are kept in? Thank you!

It sounds like you are modeling ripe tomatoes and not the presence of fruit then. — StatsStudent, Aug 30 '20 at 23:39
You are right, thank you for pointing that out. In that case, all zero values should be kept correct? — Elizabeth, Aug 31 '20 at 00:09
What information do you want to gain from the ANOVA? What descriptive analyses have you already conducted for your data set? To test such hypothesis is usually surprisingly simple by the right permutation test. No need for complicated models with zillions of assumptions. I think you should not delete zeros, but I might be wrong in this point. — Michael M, Aug 31 '20 at 17:10

Stefan · Answer 1 · 2020-08-31T15:16:58.827

Since you are counting the number of fruits, ANOVA is not the best way to analyze this as it assumes the data are normally distributed with mean $\mu$ and constant variance $\sigma$, that is, $y_i \sim N(\mu_i,\sigma)$. Your counts however, are strictly positive (zeros included) and discrete (i.e. integers). The normal distribution supports continuous data from $-\infty$ to $+\infty$, which clearly isn't the best fit for your data. Furthermore, another assumption of ANOVA is that the data represent independent observations. Since you are measuring each tomato plant weekly, there is a clear dependency structure in your dataset.

My suggestion would be a generalized linear mixed model (GLMMs) using a Poisson or negative binomial distribution. The former is a good approach for count data that also include some zeros; the latter could be a choice if there are too many zeros in your data than can be accounted for by the Poisson distribution.

Using a GLMM might feel a bit overwhelming at first but there is a lot of information on such models. One helpful paper may be Bolker et al. (2009): Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology and Evolution, 24, 127-135.

Another good intro to mixed modeling can be found here (using the R statistical software package): https://ourcodingclub.github.io/tutorials/mixed-models/

Here is another practical example (using R) for a generalized linear model (not a generalized linear mixed model though but with Poisson vs. negative binomial).

Thank you for your in depth reply and attached resources. I appreciate this a lot and will look into it. — Elizabeth, Aug 31 '20 at 02:42
Poisson and negative binomial regression also have zero-inflated variants @Elizabeth — Dave, Jun 15 '21 at 19:12

When is it appropriate to keep/remove zero values from a dataset (one-Way ANOVA)?

1 Answers1