Transformations for zero inflated non-negative continuous response variable in R

Question

Initially, this was going to be a simple analysis of variance (ANOVA) with a categorical predictor (4 levels) and a non-negative continuous response variable (number of aphids on leaves from four branches of each treated plant).

The data are highly zero-inflated. The zeros are are true data points (not missing or NAs). Although the frequency histogram looks like the data are integers, they are not counts and range from 0 to 15.89.

I am interested in determining whether the treatment (3 different insecticides and 1 untreated control) has an effect on the number of aphids per leaf: I'm interested in determining if the means of the different insecticides treatments are equal to each other or not.

Are there any suggestions for appropriate transformations? Ideally, a zero-inflated beta regression would solve the issue, but for my current timeline and experience with statistical analyses and R, it's not feasible at this time.

I have tried many transformations. Histogram of Log(x+1) of response variable included

Since the response variable aphids per leaf (#aphids/#leaves) is calculated from count data (total number of aphids and total number of leaves observed), I also tried a Poisson Regression GLM used aphid number as the response variable with leaves as an offset term using the following code

fma <- glm(aphidnumber~insecticide+offset(log(leaves)), data=aphid, family=poisson(link="log"))
summary(fma)
Anova(fma)

Welcome to CV, E10. What is your research question? Answering that might help prospective answerers zero in on an analysis that works for you. — Alexis, Jul 01 '22 at 02:10
Thank you! I'm working on learning how to write a good question on here. I hope the additional information is helpful. — E10, Jul 01 '22 at 02:28
What's the unit of measurement? Money, volume, ... I d suggest a log1p transform to check it this is marginally zero inflated lognormal. Last but not least what question are you trying to answer — Georg M. Goerg, Jul 01 '22 at 02:50
Hi! the measurements is aphids per leaf. The sampling methods were randomly selecting 4 branches on each tree, counting the number of leaves and counting the number of aphids on those leaves. (So the response variable being analyzed is calculated from count data). I'm trying to determine efficacy of the insecticide treatments against this specific aphid. Just to confirm the log1p() is the natural log transformation in R — E10, Jul 01 '22 at 03:49
Hi @E10 -- I slightly edited your question but may not have quite got the sampling part correct (it may not matter for your question: I think the key point is that you've got some version of multiple leaves per treated plant, and you're averaging those to get per-plant estimates?) — James Stanley, Jul 01 '22 at 04:42
Hi James, thanks for your edits. They were not averaged- I hope my edits to your edits are clear. For example: 4 branches are randomly selected, there were 68 leaves total, and a total of 13 aphids were counted on those 68 leaves. The final data point for that tree would be total aphids/total leaves (13/68= 0.191 aphids per leaf) — E10, Jul 01 '22 at 04:56
@Georg 'model1<-lm(log1p(aphidleaf) ~ insecticide, data = aphid.d1) summary(model1)' — E10, Jul 01 '22 at 06:23
Log1p is just log(1+x) so that 0 maps to 0 again. Can you show a histogram of that? Also the summary(model) results don't show in comments. Can you add to original post? — Georg M. Goerg, Jul 02 '22 at 00:01
@GeorgM.Goerg, I tried a bunch of transformations. Even (log(smallest y/2+x). Histograms and shapiros were no where close to normal. I'll add a histogram example to the post (Thanks for the suggestion). I'm now looking into a nonparametric test. I know Kruskal-Wallis will be appropriate for my question, but I did happen to find a zero-inflated Kruskal-Wallis https://github.com/chvlyl/ZIR — E10, Jul 02 '22 at 01:27
Actually I was looking for a log1p plot of the original data , not the residuals. If that histograms looks like zero inflated normal (after log1p) then I recommend https://github.com/google/lifetime_value — Georg M. Goerg, Jul 02 '22 at 04:00
@GeorgM.Goerg edits made according to your comments. logp1<-log((aphidleaf)+1) hist(logp1) — E10, Jul 02 '22 at 04:45

EdM · Answer 1 · 2022-07-02T18:37:49.087

Your idea to work with logs here is heading in the right direction, but there's a better way to do it.

It's generally best to work as close as possible to the original data when modeling. Your original data are counts, so you should model them as counts with an appropriate generalized linear model: e.g., Poisson, quasi-Poisson, or negative binomial. A log link is typically used for count data.

You account for the number of leaves with an offset() term in your model for the count on each plant; that's offset(ln(numberOfLeaves)) with a log link. That enforces a regression coefficient of 1 for the offset. Similar offset terms are used when counts are collected over different periods of time or over different extents of area.

Adapting part of the above-linked answer to your situation, with $\lambda$ being the mean predicted counts, a log link (so you are predicting $\ln(\lambda)$), and the above offset, your model is fundamentally based on the following linear predictor:

$$ \ln(\lambda) = \beta_0 + \beta_1 X + 1\times \ln(\text{numberOfLeaves}) $$

Here, $\beta_0$ is the intercept, for the reference "treatment" that we will take to be the control, $\beta_1$ is a 3-element coefficient vector for differences in outcome from the reference for the treatments, and $X$ is a coefficient vector representing the corresponding treatment dummy variables.

To get the predicted infestation per leaf, note that when numberOfLeaves = 1, then ln(1) = 0. The intercept $\beta_0$ is thus $\ln(\lambda)$ for 1 leaf at the reference (control) condition (all $X_i$ = 0), and the other coefficients are the differences from control in $\ln(\lambda)$ for 1 leaf under each of the treatments. Thus those coefficients are the values you desire per leaf, in the log scale of the model.

These are very standard statistical methods, with many pages discussing them on this Cross Validated web site and with worked-through examples using several statistical software packages on this UCLA web site.

Thoughts on zero inflation

You might be over-thinking the zero inflation problem. Having a lot of zeros does not necessarily mean you have zero inflation if the count numbers are small. If the treatments are effective, then treated plants might well have many leaves with 0 aphids. The zero inflation needs to be evaluated in the context of each treatment condition individually; the plots you show seem to combine all treatments and control together.

Don't forget that with a zero-inflated model you need to specify models for both the zero-inflation part and for the count-based part. It's not clear to me how you would properly model the zero-inflation part here, or even that you need such a model. Over-dispersion, due for example to un-modeled effects on infestation, might need to be handled with a quasi-Poisson or negative binomial model; those can contribute to apparent zero inflation.

Thoughts on multiple measurement dates

If you have multiple measurements on the same plants over time, then your model needs to take the within-plant correlations into account. The observations are then not all independent.

A mixed model treating the plants as random effects can do that and allow for estimated differences among plants in baseline susceptibility to infestation. Such differences among plants, with some plants inherently resistant, might account for apparent "zero inflation" when you pool all the values together in your plot.

If you go back to the same branches or leafs on each measurement date, then you should set up a hierarchical model that accounts for leaves within branches within plants.

Doing a separate model for each date is not an efficient use of your data. That also can run into issues with multiple comparisons. You can include the measurement date as a categorical predictor, or use a regression spline to model smooth changes over time. To account for differences over time among treatments/control, you probably should include interaction terms between treatments and measurement date.

I think your approach is very helpful. Would there be any issues with switching to nonparametric given all the hurdles and using Zero-inflated Kruskal Wallace instead? https://github.com/chvlyl/ZIR — E10, Jul 02 '22 at 03:53
Additionally, I gave this a try: fma<-glm(aphidnumber~insecticide+offset(log(leaves)), data=aphid, family=poisson(link="log")) summary(fma) Anova(fma) autoplot(fma, which=1:2) I'll post the QQ plot in original post. There is a zero inflated Poisson for lots of zeros (ZIP). That would probably be better. — E10, Jul 02 '22 at 06:58
@E10 the normal q-q plot you produced with autopilot() isn't appropriate for count data, as residuals aren't expected to be normally distributed. The UCLA web page linked at the end of the answer illustrates negative binomial and zero-inflated Poisson or negative binomial models. You could probably get by with a non-parametric test. You also might consider a mixed model, as you have a nesting structure of leaf within branch within plant. — EdM, Jul 02 '22 at 12:06
I’ used the UCLA page and worked through all of those different models and followed up to test over dispersion of each of the model examples. Is there any way to check my coding? I feel like I’m leaving something out. — E10, Jul 02 '22 at 15:37
@E10 the examples there are oriented toward technical implementation (e.g., in R) and might not be good examples of real-life over-dispersion. Beyond that, much depends on what you mean by "I feel like I'm leaving something out." Strictly coding-specific questions are off-topic on this site and are best posted to Stack Overflow. See this search on that site, for example. Whether a particular coding captures the statistical analysis you wish to perform can be on-topic here. — EdM, Jul 02 '22 at 16:00
I'm just not quite certain how to get back to the response variable aphids per leaf. Is this because the offset value creates the ratio- I'm just not entirely sure what is happening "behind the scenes" with the code. I've also tried ZINB: '''M4 <- zeroinfl(aphidnumber ~ insecticide+offset(log(leaves)), dist = 'negbin', data=aphid)'''. — E10, Jul 02 '22 at 17:14
I'm concerned with the interpretation, because these GLM methods I have tried are using the entire data set. Within the data, there are 8 sample dates. I'm not interested in the difference in reponse var over time (as in repeated measures), but was hoping to compare the treatment groups within the sample date (I.e. running one way anova (before knowing the distribution of the data) for each sample data, or kruskal wallis for each date) — E10, Jul 02 '22 at 17:18
@E10 I've edited the answer to address your last 2 comments and to say a bit about the possibility of zero inflation. — EdM, Jul 02 '22 at 18:38
Thank you so much. I'm definitely stressing about the zeros since I've tried so many things over the last month. I think I'd like to go with non parametric kruskal wallis and report the findings then consider more robust models for the published manuscript. I just don't have much experience with these situations — E10, Jul 02 '22 at 18:59

Transformations for zero inflated non-negative continuous response variable in R

1 Answers1