I have the following dataset:
df <-
structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L,
2L, 2L), .Label = c("A", "B", "C", "D", "E"), class = "factor"),
var1 = c(0.251249939, 0.243318364, 0.11978431, 0.174523607,
0.108485416, 0.296143562, 0.429010987, 0.013022964, 0.017849743,
0.023047687, 0.288872927, 0.352767229, 0.206665024, 0.260082066,
0.303004652, 0.318304062, 0.283759445, 0.649184883, 0.555413425,
0.593164206, 0.216715872, 0.189841628, 0.452209592, 0.184777394,
0.183090359, 0.25703001), var2 = c(0.10967546, 0.045433734,
0.03156868, 0.065526947, 0.024798298, 0.114657596, 0.345043868,
0.0018319, 0, 0, 0.036639825, 0.092710786, 0.072756045, 0.056428008,
0.096712127, 0.084123246, 0.067709602, 0.042034484, 0.060061432,
0.165271029, 0.039985284, 0.041120779, 0.04379772, 0.068814874,
0.056523118, 0.125201181), var3 = c(5.840212345, 4.281783104,
1.757990956, 2.934711695, 2.12902379, 3.956520319, 7.857996941,
0.667544842, 0.781137168, 0.910673857, 5.603694439, 6.384156227,
3.304251909, 4.735065937, 5.320442677, 5.131484509, 6.590584278,
14.44481945, 7.413087368, 9.481782913, 4.40541172, 3.100687265,
7.982603073, 3.759469509, 3.771872759, 5.209916592)), .Names = c("id",
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA,
-26L))
These are mass fractions (in percent) of certain organic compounds found in soils. There are some problems with the data e.g. not well balanced and one ID (ID = C) only contains three samples and the values are very low, sometimes even 0, i.e. not detectable by the machine (var2, var3). I want to know whether there are differences between IDs for the given compounds (var1, var2, var3).
require(tidyr)
df %>% gather( key = "key", value = "value", -id) -> df_g
ggplot(df_g, aes(id, value)) + geom_point() + facet_wrap(~key, scales = "free") +
labs(x = "ID",y = "Mass fraction %")
So given the data there are a few approaches I can come up with:
# OLS model
m1<-lm(var1 ~ id, data = df)
# summary(m1)
plot(resid(m1, type = "pearson") ~ df$id)
# Log transformed Y in OLS model
m1_log<-lm(log(var1) ~ id, data = df)
# summary(m1_log)
plot(resid(m1_log, type = "pearson") ~ df$id)
# Generlized least squares and modeling within-group heteroscedasticity structure by specifying the weights argument
require(nlme)
m2 <- gls(var1 ~ id, weights = varIdent(form=~1|id), data = df)
# summary(m2)
plot(resid(m2, type = "pearson") ~ df$id)
# Beta regression. The most appropriate in my view given the data
require(betareg)
m3 <- betareg(var1 ~ id, data=df)
# summary(m3)
plot(resid(m3, type = "pearson") ~ df$id)
While beta regression seems to be the best approach given the data, the residuals clearly show high variability among IDs. I was wondering if someone has a suggestion for me how to go forward with this analysis.
Also, if beta regression is the way to go, how can I deal with zeros and how to get Y in the interval between 0 and 1 (see var2, var3)?






betareg(formula = var1 ~ id | id, data = df)where both location $\mu$ and precision $\phi$ depend on theid. And as only two observations are censored invar2one could probably use the ad hoc scaling suggested in Smithson & Verkuilen (seebetaregvignette). Invar3the results might be more sensitive to the scaling but still worth a try. – Achim Zeileis Sep 27 '17 at 23:19betaregand how that changes the residual plot above. Also you actually answered my question regarding potential zeros in the data by referring to the ad hoc scaling procedure suggested in Smithson & Verkuilen. – Stefan Oct 13 '18 at 21:39