Which link function could be used for a glm where the response is per cent (0 - 100%)?

Question

I am thinking about building a model (glm) where the response variable (y) is the cover (in per cent) of a plant species in a defined area, dependant of environmental variables. However, I don't think it is a binomial link, as the output is not a probability between 0 and 1. What would you suggest to use as a link function?

Sometimes theory will suggest a link. See https://stats.stackexchange.com/a/64039/919 for an example. — whuber, Feb 05 '21 at 17:20
@whuber: Bill, we are like two peas in the same pod - we naturally gravitate towards the same questions! — Isabella Ghement, Feb 05 '21 at 18:07

Isabella Ghement · Accepted Answer · 2021-02-11T19:49:14.233

Nice question!

You can start your modelling efforts by converting your cover variable so that it is expressed as a proportion rather than a percentage (i.e., simply divide the values of your original cover variable by 100).

Once you get the converted variable, look to see how many 0 and 1 values you have, if any, in addition to values falling in the (0,1) interval.

If you don't have any 0 or 1 values, you can use beta regression modelling. A beta regression model is not a glm model but it has similarities to a glm model. The beta regression model has a logit link function; this means that you are modelling the logit-transformed mean cover as a function of various predictor variables. (Recall that cover is now expressed as a proportion.)

If you have a substantial number of 0 values but no 1 values, you will need to use a zero-inflated beta regression model for your modelling.

If you have a substantial number of 1 values but no 0 values, you will need to use a one-inflated beta regression model.

If you have a substantial number of 0 values and a substantial number of 1 values, you will need to use a zero-and-one-inflated beta regression model.

Do you use R? If yes, the gamlss package is your best bet for implementing these kinds of models - see, for instance, page 107 of this document:

https://www.gamlss.com/wp-content/uploads/2013/01/book-2010-Athens1.pdf.

The inflated versions of the beta regression models have multiple links because they simultaneously model multiple parameters. Some of these links can be logit links, some can be log links, etc. In gamlss, the family distribution for these models can be one of:

BE for beta regression;
BEINF0 for zero-inflated beta regression;
BEINF1 for one-inflated beta regression;
BEINF for zero-and-one inflated beta regression.

Addendum

Both betareg() and gamlss() afford the ability to fit a beta regression model to your data and then use the model for prediction purposes. The R code below provides an example for how you can do this. Note that there are some differences in output between the model summaries produced by the two functions from the same dataset.

#===========================================================
# Load Gasoline data into R 
#===========================================================
library(betareg)
data("GasolineYield", package = "betareg")
#===========================================================
Fit beta regression model to Gasoline data using betareg()
#===========================================================
model_betareg <- betareg(yield ~ temp, 
                         link = "logit", 
                         link.phi = "log", 
                         data = GasolineYield)
summary(model_betareg)
#===========================================================
Fit beta regression model to Gasoline data using gamlss()
#===========================================================
library(gamlss)
model_gamlss <- gamlss(yield ~ temp, 
                       family = BE(mu.link = "logit", 
                                   sigma.link = "log"), 
                       data = GasolineYield)
summary(model_gamlss)
#===========================================================
Predict from model_betareg for a new temp value
#===========================================================
predict_betareg <- predict(model_betareg, 
                           newdata = data.frame(temp = 349), 
                           type="response")
predict_betareg
#===========================================================
Predict from model_gamlss for a new temp value
#===========================================================
predict_gamlss <- predict(model_gamlss, 
                          newdata = data.frame(temp = 349), 
                          type="response")
predict_gamlss

The yield value predicted by model_betareg is 0.2046289 and the yield value produced by model_gamlss is 0.204634. These are very close, as expected.

Note that the prediction is performed on the scale of the response variable yield, so that the predicted values are expressed as proportions in the interval (0,1).

The newdata must be a dataframe which lists the values of all of the predictor variables included in the right-hand side of the beta regression model. The type of these values must match the type of the predictors variables as seen by R via the str() command:

str(Gasoline)

Here's the output for this str() command:

str(Gasoline)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':      
32 obs. of  6 variables:
$ yield   : num  6.9 14.4 7.4 8.5 8 2.8 5 12.2 10 15.2 ...
$ endpoint: num  235 307 212 365 218 235 285 205 267 300 ...
$ Sample  : Ord.factor w/ 10 levels "1"<"2"<"7"<"9"<..: 8 9 5 1 3 4 7 10 6 8 
...
$ API     : num  38.4 40.3 40 31.8 40.8 41.3 38.1 50.8 32.2 38.4 ...
$ vapor   : num  6.1 4.8 6.1 0.2 3.5 1.8 1.2 8.6 5.2 6.1 ...
$ ASTM    : num  220 231 217 316 210 267 274 190 236 220 ...

If temp were listed as a factor in your dataset (e.g., "High", "Medium", "Low"), you would specify it as such in your newdata:

predict_gamlss <- predict(model_gamlss, 
                          newdata = data.frame(temp = c("High","Low")), 
                          type="response")
predict_gamlss

Thank you very much! I am not an expert in statistical analysis and maybe that's the reason I have not come across beta regression in vegetation ecological studies. I am afraid to be faced with incomprehension when I come up with beta regression models. In a blog (http://plantecology.syr.edu/fridley/bio793/glm.html), I read about arsine square root transformation for the dependent variable and using linear models. They cited Crawley. I tried this out, however, I also get negative values in some cases, what is nonsense, because it is not possible to have a negative cover. — parallax, Feb 05 '21 at 07:52
Maybe you didn’t know where to look? Here’s one reference on Estimating plant abundance using inflated beta distributions: Applied learnings from a lichen–caribou ecosystem by Keim et al. that will come in handy: https://onlinelibrary.wiley.com/doi/pdf/10.1002/ece3.2625. — Isabella Ghement, Feb 05 '21 at 12:47
If beta regression is overly complex for your target audience, then you can indeed use a transformation of your response variable (cover, expressed as a proportion), though that comes with its own interpretation challenges. The transformation should definitely NOT be an arcsine transformation. The article The arcsine is asinine: the analysis of proportions in ecology by Warton and Hui explains why an arcsine transformation is a bad idea: “For non‐binomial data, the arcsine transform is undesirable on the grounds of interpretability, and because it can produce nonsensical predictions.” — Isabella Ghement, Feb 05 '21 at 12:53
The Warton and Hui article is available at: https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/10-0340.1. The article advocates using the logit transformation instead of the ‘asinine’ arcsine transformation. But how applicable the logit transformation is to your (proportion) cover data really depends on the extent of 0 and 1 values present in that data. — Isabella Ghement, Feb 05 '21 at 12:58
If you are going to use the logit transformation, you might as well consider the beta regression instead (presuming you havr no 0 and no 1 values in your proportion cover data). At least with beta regression you are modelling what you want - mean proportion cover, logit-transformed. (You can apply the inverse logit transformation to get the mean proportion cover, which is exactly what you want.) With the logit transformation, you are modelling the mean value of the logit-transformed proportion cover, which is a heck of a lot more difficult to interpret and not really what you want. — Isabella Ghement, Feb 05 '21 at 13:07
The logit transform cannot be applied to proportions cover that are 0 or 1; if you have a substantial number of 0 and/or 1 values in your proportion cover data, you are going to have to “fudge” those to move them slightly away from 0 and 1, respectively. The “fudging” will unfortunately introduce bias in your modelling - the more “fudging”, the more bias. It’s not even clear what the “fudging” constant that you add to 0 or subtract from 1 should be. For all these reasons, beta regression is a much more principled option for analyzing your type of data. — Isabella Ghement, Feb 05 '21 at 13:14
This is an interesting article and some quite convincing points, thank you again! I'll give beta regression a try;-) — parallax, Feb 05 '21 at 13:49
Cool! If you run into any problems, come back here and leave a comment. — Isabella Ghement, Feb 05 '21 at 17:50
I've tried beta regression using gamlss, however, I also need to predict to a new data set but this seems not yet implemented for gamlss. With the betareg package ist is possible. You write that the inverse logit transformation would be suitable for my purpose. Do you mean to transform (y/100) and then do the modelling, or am I wrong? I think it can't be done via a link function. — parallax, Feb 11 '21 at 10:59
@parallax: You are mistaken that gamlss can't accommodate model predictions. Please see the Addendum to my original answer for R code that proves my point. — Isabella Ghement, Feb 11 '21 at 19:54
Thank you again! I found this information here: https://stackoverflow.com/questions/42742461/error-when-predicting-new-fitted-values-from-r-gamlss-object obviously wrong. — parallax, Feb 12 '21 at 09:17
Yep! Obviously wrong - it seems the original poster in that thread just didn’t know how to specify the newdata option of predict() for gamlss models. The help file at http://finzi.psych.upenn.edu/R/library/gamlss/html/predict.gamlss.html makes it clear that gamlss models can accommodate prediction for new data. Note the option se.fit = TRUE for predict() - it will come in handy for computing standard errors associated with the predicted values. — Isabella Ghement, Feb 12 '21 at 10:04
A good rule of thumb for trusting answers on this forum is that they have to come from people who know their stuff - either top experts in their field (whose name you will recognize from your favourite articles, books, blogs, etc.) or applied practitioners (whose name you won’t recognize) who have garnered a high number of points answering questions on this site. For anyone else, a healthy dose of caution is in good order. — Isabella Ghement, Feb 12 '21 at 10:12

Which link function could be used for a glm where the response is per cent (0 - 100%)?

1 Answers1

Fit beta regression model to Gasoline data using betareg()

Fit beta regression model to Gasoline data using gamlss()

Predict from model_betareg for a new temp value

Predict from model_gamlss for a new temp value

Linked