modelling proportion data - transforming 0 and 1s to fit beta regression

Question

I have proportion data which includes 0s and 1s (the ri column below, calculated from the ud/days_lib)). I have found a few threads related to modelling this issue. A few of which (e.g., this thread) suggest transforming the data, using the following formula, y * (n−1) + 0.5) / n, where n is the sample size, from Smithson & Verkuilen (2006).

I'm not entirely sure how to apply this however. Does this get applied to all proportions, or just 0s and 1s? And I'm also unsure of what the sample size actually means, is this the number of rows in the data set or the denominator used to create the proportion?

I have included a sample of my dataset below and code for a reproducible example.

In the below example would I use 6 as the sample size or use the days_lib (31, 30 or 28) on the ri to transform it?

If there have been any further updates on how to model these datasets without transformation I would be keen to hear about them. Ideally with a package that can be used with dredge in MuMIn as I will be using a information theoretic approach on this data.

        station monthyear ud days_lib     ri         species   SE_score     
372     PB03   2016/01    3      31    0.0968   Silvertip Shark  0.35     
2054 SAUWM01   2018/01    27     31    0.8710   Grey Reef Shark  0.15     
1054    PB26   2014/11    22     30    0.7333   Grey Reef Shark  0.17     
1847    SA06   2015/02    28     28    1.0000   Silvertip Shark  0.30     
1055    PB26   2014/11    24     30    0.8000   Silvertip Shark  0.17     
316     PB02   2016/01     2     31    0.0645   Grey Reef Shark  0.54

Date set below

structure(list(station = structure(c(13L, 53L, 35L, 50L, 35L, 
12L), levels = c("BE01", "BE02", "BEUWM01", "BL01", "BL02", "GCB01", 
"GCB02", "GCB03", "NI01", "NI01b", "PB01", "PB02", "PB03", "PB04", 
"PB05", "PB06", "PB07", "PB09", "PB10", "PB11", "PB12", "PB13", 
"PB14", "PB15", "PB16", "PB17", "PB18", "PB19", "PB20", "PB21", 
"PB22", "PB23", "PB24", "PB25", "PB26", "PB27", "PB28", "PB29", 
"PB30", "PB4G01", "PB4G02", "PBUWM01", "PBUWM02", "SA01", "SA02", 
"SA02b", "SA03", "SA04", "SA05", "SA06", "SA07", "SA11", "SAUWM01", 
"SB01", "SB02/AR02", "SB04/AR06", "VB01", "VB02", "VB03", "VB04"
), class = "factor"), monthyear = structure(c(25L, 49L, 11L, 
14L, 11L, 25L), levels = c("2014/01", "2014/02", "2014/03", "2014/04", 
"2014/05", "2014/06", "2014/07", "2014/08", "2014/09", "2014/10", 
"2014/11", "2014/12", "2015/01", "2015/02", "2015/03", "2015/04", 
"2015/05", "2015/06", "2015/07", "2015/08", "2015/09", "2015/10", 
"2015/11", "2015/12", "2016/01", "2016/02", "2016/03", "2016/04", 
"2016/05", "2016/06", "2016/07", "2016/08", "2016/09", "2016/10", 
"2016/11", "2016/12", "2017/01", "2017/02", "2017/03", "2017/04", 
"2017/05", "2017/06", "2017/07", "2017/08", "2017/09", "2017/10", 
"2017/11", "2017/12", "2018/01", "2018/02", "2018/03", "2018/04", 
"2018/05", "2018/06", "2018/07", "2018/08", "2018/09", "2018/10", 
"2018/11", "2018/12"), class = "factor"), ud = c(3L, 27L, 22L, 
28L, 24L, 2L), days_lib = c(31, 31, 30, 28, 30, 31), ri = c(0.0968, 
0.871, 0.7333, 1, 0.8, 0.0645), species = structure(c(2L, 1L, 
1L, 2L, 2L, 1L), levels = c("Grey Reef Shark", "Silvertip Shark"
), class = "factor"), SE_score = c(0.35, 0.15, 0.17, 0.3, 0.17, 
0.54), region = structure(c(5L, 6L, 5L, 6L, 5L, 5L), levels = c("Benares", 
"Blenheim", "Grand Chagos Bank", "Nelsons Island", "Peros Banhos", 
"Saloman", "Speakers Bank", "Victory Bank"), class = "factor"), 
    month = c(1, 1, 11, 2, 11, 1), season = structure(c(2L, 2L, 
    2L, 2L, 2L, 2L), levels = c("dry.season", "wet.season"), class = "factor"), 
    year = structure(c(3L, 5L, 1L, 2L, 1L, 3L), levels = c("2014", 
    "2015", "2016", "2017", "2018"), class = "factor")), row.names = c(372L, 
2054L, 1054L, 1847L, 1055L, 316L), class = "data.frame")

I'm not clear on what you are asking here. Where are the 1s and 0s that you mention? You haven't explained what you are trying to get proportions of from your data. Could you perhaps edit to clarify? Thanks. — Allan Cameron, Sep 07 '22 at 17:06
$n$ reffers to the number of trials per observation, not the number of observations. — Demetri Pananos, Sep 07 '22 at 17:40
In the example my proportion data are in the 'ri' column. So I have proportion data which include values between 0 and 1 as well as including 0 and 1. If I understand correctly, n is the denominator (days_lib column) of the proportion?for the transformation. so to transform my proportion data the formula would be ri*(days_Lib-1)+0.5/days_lib? How does this transform it if there is a 0 value? — mikejwilliamson, Sep 13 '22 at 13:09
Is there any reason you can't use a binomial likelihood, i.e. logistic regression? — John Madden, Sep 13 '22 at 13:11
I initially tried with a binomial but I was getting a warning error, (as in this link thread - https://stats.stackexchange.com/questions/233366/how-to-fit-a-mixed-model-with-response-variable-between-0-and-1) which led me to believe binomial wasn't the best type of distirubtion for this data. Beta regression is better for proportion data, but has issues if data contains 0 and 1 so looking for a way to deal with this issue, and one method is transformation — mikejwilliamson, Sep 13 '22 at 13:24
Is the error you got "non-integer number of successes"? You don't give the logistic regression the proportions, you give it each 0 or 1 individually (and probably do want to use random effects to account for station effects). "Beta regression is better for proportion data" it's not clear to me why this would be true, do you have a reference you can share? — John Madden, Sep 13 '22 at 13:28
Thanks. This is calculation of how resident an animal is each month, and we need monthly data to compare with our explanatory variables which are monthly averages (SST, chl-a etc). So I don't think I can code it as 0 or 1 per day, as we don't have daily data for the explanatory variables. As such I'm using proportion data, which from this link https://www.theanalysisfactor.com/zero-one-inflated-beta-models-for-proportion-data/, led me to believe beta regression might be a better option. However, I've just found there's a weights argument for a binomial GLMM I can apply. — mikejwilliamson, Sep 13 '22 at 15:04

score 2 · Accepted Answer · answered Sep 13 '22 at 15:36

If you cannot code your data to 1 and 0 for a glmm, rather than transforming the proportions, I found three threads (one, two, three) that suggest you can still use a binomial glmm. In my case I coded the response variable ud/days_libs to get proportion, rather using proportion directly, and used the weights argument set as the number of samples. In my case days_lib.

glmer(ud/days_lib ~ SE_score + species + season + year + 
            (1|station) + (1|monthyear), weights=days_lib , family="binomial",  
            data=RS_rI, na.action=na.fail)

This also removes the warning message below I was getting if you use direct proportion data ('ri' column in this dataset).

Warning message:
In eval(family$initialize, rho) : non-integer #successes in a binomial glm!

(+1) As often happens, you've found that the best answer to your original question (about beta regression) is to use a different modeling approach (binomial regression). Be very, very cautious in using the automated model selection provided by functions like dredge() as you proceed in your work, however. See this page for many reasons why. — EdM, Sep 13 '22 at 16:23

score 1 · Answer 2 · answered Sep 13 '22 at 13:51

If you know the $n$s, then you are better off treating the data as binomial ($y$ out of $n$) rather than using the calculated proportion $y/n$ in beta-regression. The reason is that you otherwise ignore how well you've estimated each proportion (0/1 and 0/1000 are very different things). That becomes even more important when some of the denominators are very different between records.

modelling proportion data - transforming 0 and 1s to fit beta regression

2 Answers2