2

I have the following dataset:

df <- 
  structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 
                                  3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L, 
                                  2L, 2L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
                 var1 = c(0.251249939, 0.243318364, 0.11978431, 0.174523607, 
                          0.108485416, 0.296143562, 0.429010987, 0.013022964, 0.017849743, 
                          0.023047687, 0.288872927, 0.352767229, 0.206665024, 0.260082066, 
                          0.303004652, 0.318304062, 0.283759445, 0.649184883, 0.555413425, 
                          0.593164206, 0.216715872, 0.189841628, 0.452209592, 0.184777394, 
                          0.183090359, 0.25703001), var2 = c(0.10967546, 0.045433734, 
                                                             0.03156868, 0.065526947, 0.024798298, 0.114657596, 0.345043868, 
                                                             0.0018319, 0, 0, 0.036639825, 0.092710786, 0.072756045, 0.056428008, 
                                                             0.096712127, 0.084123246, 0.067709602, 0.042034484, 0.060061432, 
                                                             0.165271029, 0.039985284, 0.041120779, 0.04379772, 0.068814874, 
                                                             0.056523118, 0.125201181), var3 = c(5.840212345, 4.281783104, 
                                                                                                 1.757990956, 2.934711695, 2.12902379, 3.956520319, 7.857996941, 
                                                                                                 0.667544842, 0.781137168, 0.910673857, 5.603694439, 6.384156227, 
                                                                                                 3.304251909, 4.735065937, 5.320442677, 5.131484509, 6.590584278, 
                                                                                                 14.44481945, 7.413087368, 9.481782913, 4.40541172, 3.100687265, 
                                                                                                 7.982603073, 3.759469509, 3.771872759, 5.209916592)), .Names = c("id", 
                                                                                                                                                                  "var1", "var2", "var3"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                               -26L))

These are mass fractions (in percent) of certain organic compounds found in soils. There are some problems with the data e.g. not well balanced and one ID (ID = C) only contains three samples and the values are very low, sometimes even 0, i.e. not detectable by the machine (var2, var3). I want to know whether there are differences between IDs for the given compounds (var1, var2, var3).

require(tidyr)
df %>% gather( key = "key", value = "value", -id) -> df_g
ggplot(df_g, aes(id, value)) + geom_point() + facet_wrap(~key, scales = "free") + 
    labs(x = "ID",y = "Mass fraction %")

enter image description here

So given the data there are a few approaches I can come up with:

# OLS model
m1<-lm(var1 ~ id, data = df)
# summary(m1)
plot(resid(m1, type = "pearson") ~  df$id)

enter image description here

# Log transformed Y in OLS model
m1_log<-lm(log(var1) ~ id, data = df)
# summary(m1_log)
plot(resid(m1_log, type = "pearson") ~  df$id)

enter image description here

# Generlized least squares and modeling within-group heteroscedasticity structure by specifying the weights argument
require(nlme)
m2 <- gls(var1 ~ id,  weights = varIdent(form=~1|id), data = df)
# summary(m2)
plot(resid(m2, type = "pearson") ~  df$id)

enter image description here

# Beta regression. The most appropriate in my view given the data
require(betareg)
m3 <- betareg(var1 ~ id, data=df)
# summary(m3)
plot(resid(m3, type = "pearson") ~  df$id)

enter image description here

While beta regression seems to be the best approach given the data, the residuals clearly show high variability among IDs. I was wondering if someone has a suggestion for me how to go forward with this analysis.

Also, if beta regression is the way to go, how can I deal with zeros and how to get Y in the interval between 0 and 1 (see var2, var3)?

Stefan
  • 6,431
  • 2
    (1) The zeros aren't true zeros: they are placeholders for left-censored data. Treating them in procedures as if they were zero can lead to problems. There are many options, but the good ones all start by including the detection limits in the dataset. (2) A generalized ILR can work nicely for visualizing and testing compositional data. See https://stats.stackexchange.com/search?tab=votes&q=ILR%20transformation. – whuber Sep 27 '17 at 20:28
  • 2
    The observations are clearly rather heteroscedastic so that a single precision parameter across all five samples does not work well. Have a look at: betareg(formula = var1 ~ id | id, data = df) where both location $\mu$ and precision $\phi$ depend on the id. And as only two observations are censored in var2 one could probably use the ad hoc scaling suggested in Smithson & Verkuilen (see betareg vignette). In var3 the results might be more sensitive to the scaling but still worth a try. – Achim Zeileis Sep 27 '17 at 23:19
  • Thank you @AchimZeileis ! The residual plot looks much better! I just started with beta regression and have to learn more about the precision parameter. One other question, how can I figure out when data (var 3) might be too sensitive to scaling? And what does it mean for my analysis? – Stefan Sep 28 '17 at 16:12
  • 1
    Sensitivity: If results from your regression change qualitatively depending on how exactly you scale the data, then scaling is probably not the way to go. Alternative: Use a two-part model: (1) Binary model for 1 vs. lower. (2) Beta regression for those observations in (0, 1). – Achim Zeileis Sep 28 '17 at 23:40
  • @AchimZeileis would you mind converting your first comment into an answer? I think otherwise people may miss this important point of how to account for heteroscedasticity in the data using betareg and how that changes the residual plot above. Also you actually answered my question regarding potential zeros in the data by referring to the ad hoc scaling procedure suggested in Smithson & Verkuilen. – Stefan Oct 13 '18 at 21:39

1 Answers1

0

Since my question hasn't received an official answer yet, however @AchimZeileis has pretty much answered it in the comments, I decided to post it as an answer so that people don't miss it:

The observations are clearly rather heteroscedastic so that a single precision parameter across all five samples does not work well. Have a look at: betareg(formula = var1 ~ id | id, data = df) where both location $\mu$ and precision $\phi$ depend on the id. And as only two observations are censored in var2 one could probably use the ad hoc scaling suggested in Smithson & Verkuilen (see betareg vignette). In var3 the results might be more sensitive to the scaling but still worth a try.

And here the plot where both location $\mu$ and precision $\phi$ depend on id (compare with m3 above): enter image description here

Stefan
  • 6,431