Proportion data with number of trials known (and separation?): GLM or beta regression?

Question

I perform a lot of bioassays in which I score mortality not on individuals, but on groups of individuals as a proportion (the denominator, i.e., number of trials, is known):

Sample Data:

library(tidyverse)
library(betareg)
dataset <- data.frame(treatment = rep(c("a", "b", "c", "d"), each = 5),
                      success = c(6,6,9,6,5,10,10,9,10,10,6,8,9,9,10,7,8,6,8,7),
                      fail = c(4,3,1,4,5,0,0,0,0,0,4,2,1,1,0,3,2,4,2,3)) %>%
  mutate(n = success+fail,
         treatment = factor(treatment, levels = c("a", "b", "c", "d")))

Generally I would run this as a GLM with a quasibinomial family, but in some instances (like the one provided) I have a situation where all replications in one treatment was 0% or 100%. In this situation Std. Error and CI estimates are extremely large for that treatment. I assume this is something similar (or identical) to complete separation, although I have only ever seen complete separation discussed in binomial response data consisting of 0's and 1's.

Regular GLM:

data.list <- list(success.fail = cbind(dataset$success, dataset$fail), treatment = dataset$treatment)
binary.glm <- glm(success.fail ~ treatment, data = data.list, family = quasibinomial)
summary(binary.glm)
> summary(binary.glm)
Call:
glm(formula = success.fail ~ treatment, family = quasibinomial, 
    data = data.list)
Deviance Residuals: 
     Min        1Q    Median        3Q       Max

-1.81457  -0.33811   0.00012   0.54942   1.86737
Coefficients:
             Estimate Std. Error t value Pr(>|t|)

(Intercept)    0.6325     0.2622   2.412   0.0282 *
treatmentb    20.4676  2890.5027   0.007   0.9944

treatmentc     1.0257     0.4271   2.402   0.0288 *
treatmentd     0.3119     0.3801   0.821   0.4239

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.7634611)
Null deviance: 43.36  on 19  degrees of freedom

Residual deviance: 13.40  on 16  degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 18

I've read about dealing with separation in this post, but I haven't a way to use Firth's penalized likelihood with proportion data like mine. Instead I turned to beta regression as an option. Is it reasonable to use my n values (number of observations) for each proportion as a weighting value in the weights argument of betareg()? I transformed my proportions to fit within the (0,1) interval using equation (4) of Smithson & Verkuilen, "A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables" (direct link to PDF).

Beta Regression:

lemonsqueeze.fun <- function(df, success, fail) {
  new <- df %>%
    mutate(prop.naught = {{success}}/({{success}}+{{fail}}),
           n = {{success}}+{{fail}},
           y.prime = (prop.naught-0)/(1-0),
           y.doubleprime = ((y.prime*(length(y.prime)-1))+0.5)/length(y.prime))
  return(new)
}
transformed.data <- lemonsqueeze.fun(dataset, success, fail)
beta.mod <- betareg(y.doubleprime ~ treatment, data = transformed.data, weights = n)
summary(beta.mod)
> summary(beta.mod)
Call:
betareg(formula = y.doubleprime ~ treatment, data = transformed.data, weights = n)
Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-6.8894 -1.9274  0.0000  0.5899  8.4217
Coefficients (mean model with logit link):
            Estimate Std. Error z value Pr(>|z|)

(Intercept)  0.63875    0.07291   8.761   <2e-16 ***
treatmentb   2.30484    0.14846  15.525   <2e-16 ***
treatmentc   1.04830    0.11600   9.037   <2e-16 ***
treatmentd   0.21245    0.10408   2.041   0.0412 *
Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)

(phi)   15.765      1.602   9.841   <2e-16 ***

Signif. codes:  0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 229.5 on 5 Df
Pseudo R-squared: 0.7562
Number of iterations: 21 (BFGS) + 2 (Fisher scoring)

In all cases the Std. Error is much smaller and we no longer have a crazy estimate for treatment b. Are there problems with this? Assumptions I should check? Better ways to skin this cat? Alas, I'm not a statistician, merely a biologist with barely enough statistical knowledge to be dangerous.

Please consider using base R, & commenting it extensively, when illustrating posts here with
R code. Not everyone who will come to this page will be familiar with R, & not all of those will be able to read tidy-code. This is a Q&A site for statistics, not R. — gung - Reinstate Monica, Mar 24 '22 at 18:56
I don't mean to be defensive, but this is more of a statistical question than an R question. I want to know if the statistical tools I am using are appropriate for the data. This isn't a question about coding. — RegalPlatypus, Mar 24 '22 at 19:16
I recognize that, @RegalPlatypus, it's a stock comment. The point is the tidyverse is even further from generic pseudocode than base R, & many people may wonder about this question, but be unable to read this & understand what's happening. It's best to use simpler code & comment it. This isn't a site for people who use R. Many readers won't know R at all. — gung - Reinstate Monica, Mar 24 '22 at 19:20
@RegalPlatypus did one of the answers below solve the question you asked? If so, please accept the answer (by clicking on the check mark on the left of the answers), then it is flagged as resolved here on StackExchange. If not, please make your question more precise so that it can be resolved. — Achim Zeileis, Mar 28 '22 at 06:33

Achim Zeileis · Accepted Answer · 2022-03-26T23:03:37.020

The data you have are really a classic binomial setting and you can use binomial GLMs to model the data, either using the standard maximum likelihood (ML) estimator or using a bias-reduced (BR) estimator (Firth's penalized estimator). I wouldn't use beta regression here.

For treatment b you have 49 successes and 0 failures.

subset(dataset, treatment == "b")
##    treatment success fail  n
## 6          b      10    0 10
## 7          b      10    0 10
## 8          b       9    0  9
## 9          b      10    0 10
## 10         b      10    0 10

Hence there is quasi-complete separation leading to an infinite ML estimator while the BR estimator is guaranteed to be finite. Note that the BR estimator provides a principled solution for the separation while the trimming of the proportion data is rather ad hoc.

For fitting the binomial GLM with ML and BR you can use the glm() function and combine it with the brglm2 package (Kosmidis & Firth, 2020, Biometrika, doi:10.1093/biomet/asaa052). An easier way to specify the binomial response is to use a matrix where the first column contains the successes and the second column the failures:

ml <- glm(cbind(success, fail) ~ treatment, data = dataset, family = binomial)
summary(ml)
## Call:
## glm(formula = cbind(success, fail) ~ treatment, family = binomial, 
##     data = dataset)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81457  -0.33811   0.00012   0.54942   1.86737  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.6325     0.3001   2.108   0.0351 *
## treatmentb    20.4676  3308.1100   0.006   0.9951  
## treatmentc     1.0257     0.4888   2.099   0.0359 *
## treatmentd     0.3119     0.4351   0.717   0.4734  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.36  on 19  degrees of freedom
## Residual deviance: 13.40  on 16  degrees of freedom
## AIC: 56.022
## 
## Number of Fisher Scoring iterations: 18

The corresponding BR estimator can be obtained by using the "brglmFit" method (provided in brglm2) instead of the default "glm.fit" method:

library("brglm2")
br <- glm(cbind(success, fail) ~ treatment, data = dataset, family = binomial,
  method = "brglmFit")
summary(br)
## Call:
## glm(formula = cbind(success, fail) ~ treatment, family = binomial, 
##     data = dataset, method = "brglmFit")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7498  -0.2890   0.4368   0.6030   1.9096  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.6190     0.2995   2.067  0.03875 * 
## treatmentb    3.9761     1.4667   2.711  0.00671 **
## treatmentc    0.9904     0.4834   2.049  0.04049 * 
## treatmentd    0.3041     0.4336   0.701  0.48304   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.362  on 19  degrees of freedom
## Residual deviance: 14.407  on 16  degrees of freedom
## AIC:  57.029
## 
## Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
## Number of Fisher Scoring iterations: 1

Note that the coefficient estimates, standard errors, and t statistics for the non-separated treatments only change very slightly. The main difference is the finite estimate for treatment b.

Because the BR adjustment is so slight, you can still employ the usual inference (Wald, likelihood ratio, etc.) and information criteria etc. In R with brglm2 this is particularly easy because the br model fitted above still inherits from glm. As an illustration we can assess the overall significance of the treatment factor:

br0 <- update(br, . ~ 1)
AIC(br0, br)
##     df      AIC
## br0  1 79.98449
## br   4 57.02927
library("lmtest")
lrtest(br0, br)
## Likelihood ratio test
## 
## Model 1: cbind(success, fail) ~ 1
## Model 2: cbind(success, fail) ~ treatment
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -38.992                         
## 2   4 -24.515  3 28.955  2.288e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or we can look at all pairwise contrasts (aka Tukey contrasts) of the four treatments:

library("multomp")
summary(glht(br, linfct = mcp(treatment = "Tukey")))
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## Fit: glm(formula = cbind(success, fail) ~ treatment, family = ## binomial, 
##     data = dataset, method = "brglmFit")
## 
## Linear Hypotheses:
##            Estimate Std. Error z value Pr(>|z|)  
## b - a == 0   3.9761     1.4667   2.711   0.0286 *
## c - a == 0   0.9904     0.4834   2.049   0.1513  
## d - a == 0   0.3041     0.4336   0.701   0.8864  
## c - b == 0  -2.9857     1.4851  -2.010   0.1641  
## d - b == 0  -3.6720     1.4696  -2.499   0.0517 .
## d - c == 0  -0.6863     0.4922  -1.394   0.4743  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Finally, for the standard ML estimator you can also use the quasibinomial family as you did in your post, although in this data set you don't have overdispersion (rather a little bit of underdispersion). However, the brglm2 package does not support this at the moment. But given that there is no evidence for overdispersion I would not be concerned about this here.

Can you get Tukey-style contrasts from that? Once you have br, can you just pass it to glht(), eg? — gung - Reinstate Monica, Mar 25 '22 at 19:19
Sure, no problem: glht(br, linfct = mcp(treatment = "Tukey")) as usual. — Achim Zeileis, Mar 26 '22 at 07:48
Conceptually, all inference can still be done as before, i.e., use information criteria or likelihood ratio tests, or Wald tests based on the central limit theorem (as glht does). Technically, all standard functions (from multcomp or lmtest or car etc.) you use for glm objects can be applied as well because br is an object of class brglmFit but that still inherits from glm (and lm). — Achim Zeileis, Mar 26 '22 at 07:49
I wonder it it's worth adding that to the answer, because it was possible to perform a likelihood ratio test of the model as a whole with ml (in fact, the tests of ml & br are essentially identical), & there's nothing to this model except the comparisons amongst the levels. — gung - Reinstate Monica, Mar 26 '22 at 11:57
Sure, added some more details and examples about inference with brglm2. — Achim Zeileis, Mar 26 '22 at 23:04
@AchimZeileis Thank you so much for the amount of time and thoughtfulness that you put into your answer. I have one addition that might muddy the waters a bit. What if in my original data, each proportion is an experimental replication? In my mind by converting every observation in the treatments to 0's and 1's I lose those replication groupings and thus any measure of variance within treatments? — RegalPlatypus, Mar 28 '22 at 14:21
It doesn't make any difference whether you split up the outcome of treatment a, e.g., across 5 rows (as in dataset), or across 49 rows with binary outcomes, or put everything in 1 row with success = 32 and fail = 17. In a binomial experiment all of the 49 binary events are treated as independent given the treatment. The main reason for splitting up the rows is typically that you might have further covariates (apart from the treatment) that influence the outcome. But if you have no such covariates, it makes no difference whether you see this as one proportion or many binary outcomes. — Achim Zeileis, Mar 29 '22 at 07:57
If you want to see this in practice, aggregate the data via: dataset2 <- aggregate(cbind(success, fail) ~ treatment, data = dataset, FUN = sum) and then fit models ml2 and br2 with the same code as before but using dataset2. They will lead to the same coefficients and standard errors etc. Only the constant in the deviance/likelihood will be scaled differently but this doesn't affect any tests or model selection. Also the optimizer will stop at a slightly different number for the infinite ML estimate of the separated treatment b. — Achim Zeileis, Mar 29 '22 at 08:00
@AchimZeileis: This whole ordeal has spurred me to talk to my boss about enrolling myself and a few others in my department in some online statistics courses. In the meantime, I have one more question related to this. The dataset I provided is what we would get from one replication. For one replication, everything can be converted to 1's and 0's. Everything is independent. That makes sense. But our replications are always across different days. Day becomes a new covariate with real effects on the mortality within treatments, but we aren't interested in the effects of any given particular day. — RegalPlatypus, Mar 30 '22 at 16:44
@AchimZeileis Given this would it be better to look at day as a random effect in a mixed effects model, or perhaps include it is a fixed effect and look at the estimated marginal means for the treatments holding the main effects of day constant? — RegalPlatypus, Mar 30 '22 at 16:48
If there are only a few days then a fixed effect is probably fine. And if this would lead to too many coefficients, then a random effect for day would be the natural alternative. — Achim Zeileis, Mar 30 '22 at 18:29

EdM · Answer 2 · 2022-03-25T15:12:10.450

It isn't that beta regression on its own solved the problem. It's that you adjusted the data with a line of code:

y.doubleprime = ((y.prime*(length(y.prime)-1))+0.5)/length(y.prime))

so that the beta regression software didn't have to deal with the exact 0 or 1 proportions that it (like logistic regression) can't handle directly. A vignette on the betareg package says that's "a useful transformation in practice ... where n is the sample size." I'm not sure whether that adjustment based on n should be done for each set of observations as you seem to (I'm not fluent in tidyverse) or if that should be based on the entire set of observations. Presumably you followed the recommendation of the paper you linked.

In this thread one of the package authors describes the proper use of weights in the betareg() function: "make sure that your weights are scaled such that sum(weights) corresponds to the number of independent observations." That seems to be what you did.

As you "score mortality not on individuals, but on groups of individuals as a proportion (the denominator, i.e., number of trials, is known)," there's an alternative that would allow the Firth penalization. Just put your data into a longer form with one row per alive/dead (0/1) observation.

For example, you would expand your first batch of treatment a observations into 6 rows with outcome 0 and 4 with outcome 1. Then the outcome would be a set of 0/1 values as the logistf package expects. If you want to keep track of which batch of a treatment was involved (your example suggests 5 separate batches for each of the 4 treatments) you could annotate that in each row, too.

Added in response to another answer:

I wasn't aware of the brglm2 package recommended in another answer and agree that is probably the simplest and most appropriate way to proceed. Make sure you understand the nature of the penalization you use, however. For example, the original Firth method penalizes both the intercept and the regression coefficients. Thus probability estimates from such a model might be biased even though you get better coefficient estimates. Modified approaches are now available in the logistf and brglm2 packages.

Björn · Answer 3 · 2022-03-28T07:06:21.790

There's already a good answer from Achim that points out that this is really not a good scenario for beta-regression, because you seem to be in a binomial sampling situation and beta-regression will simply be an approximation to that.

If one uses a binomial likelihood, the only question is how to appropriately deal with all 0s or all 1s. Software for standard logistic regression is not really an option, because it will not converge (the maximum likelihood estimator for some log-odds ratios will be $\pm \infty$, which leads to the algorithm stopping with some really large estimates and standard errors without converging). There's several alternative options:

If your scenario is really as simple as your example and you just have a single proportion to estimate per group and there's no covariates or experimental structure (like randomization, several treatments occurring together within different experiments etc.), you can of course just work with Clopper-Pearson confidence intervals and median-unbiased estimates for each proportion on its own.
I assume there's some more complexity to your real problem, where some kind of regression approach is needed. In that case, exact logistic regression can give you exact confidence intervals (of course sometimes the upper or lower limit of those will be $\pm \infty$) and median unbiased estimates. There's fewer software options for that than for other approaches (e.g. PROC LOGISTIC in SAS covers it, as well as StatXact, while R - as far as I know - only has a close approximation to it through a MCMC approach via the elrm package*). In the absence of covariates and anything else to take into account, exact logistic regression will end up doing the same as option (0).
Firth penalized likelihood method (corresponding to Bayesian maximum-a-posteriori estimation with Jeffreys prior on all coefficients incl. the intercept), which at times can result in some weird estimates (but pretty sensible confidence intervals).
Bayesian logistic regression with some suitable priors. This can often behave slightly better than options 1 and 2, gives you more flexibility and even let's you reflect any particular prior knowledge (but doesn't force you to, you can keep your priors pretty vague or weakly informative).
Hierarchical models (whether Bayesian or not) that assume that information should be borrowed between some units (i.e. shrinkage towards common parameters).

There's plenty of good software of (4) and (5), e.g. in R there's the brms and rstanarm packages.

* Regarding the elrm package that "claims" to do exact logistic regression via a MCMC approach, I have tried it in the past and was not sure that it approximates the exact solution really well. E.g. here is a simple example where it provides numerically rather different results than the SAS implementation that I would tend to trust a lot:

library(elrm) 
elrmfit1 = elrm(y/n ~ treatment, 
                interest = ~treatment, 
                r=2, iter = 100000, burnIn = 500, 
                dataset = data.frame(y=c(0, 10), 
                                     n=c(10,10), 
                                     treatment=factor(c(0,1)))) 
summary(elrmfit1)

That produces an estimate of the log-odds ratio of 3.9839 with 95% CI from 2.119672 to Inf. In contrast, PROC LOGISTIC in SAS produces a log-odds ratio of 4.7723 with 95% CI 2.7564 to Infinity.

Doesn't Achim's answer cover some of this already (Firth regression)? For what it's worth, the elrm package does exact logistic regression via MCMC. There's also arm::bayesglm() for option 3. — Ben Bolker, Mar 25 '22 at 16:12
Yes, it covers some of this. I know it is claimed that elrm does exact logistic, but I've never been sure that it's reliable. E.g. it mistmatches the SAS results, which I implicitly trust a lot in this example: in this situation: library(elrm)
elrmfit1 = elrm(y/n ~ treatment, interest = ~treatment, r=2, iter = 100000, burnIn = 500, dataset = data.frame(y=c(0, 10), n=c(10,10), treatment=factor(c(0,1))))

summary(elrmfit1) — Björn, Mar 25 '22 at 16:44
elrm estimate 3.9839 with 95% Confidence Intervals for Parameters from 2.119672 to Inf. SAS: estimate 4.7723 with CI 2.7564 to Infinity. — Björn, Mar 25 '22 at 16:45

Proportion data with number of trials known (and separation?): GLM or beta regression?

3 Answers3