Beta regression with success and failure raw data

Question

I am analyzing data from cancer patients that underwent surgery for cancer removal. During surgery, the surgeon checked a variable number of lymph nodes to see how many had cancer in them. This is reported as the number of cancer positive lymph nodes ($a$)and cancer negative lymph nodes ($b$) and can be expressed as the lymph node ratio ($\frac{a}{a+b}$) that is the proportion of lymph nodes positive for cancer in that patient. It is important to note that the total number of lymph nodes checked for each patient varies significant (from 1 to 109).

I want to correlate the lymph node ratio with a molecular feature like gene mutation. Each patient either has or does not have the gene mutation. The idea is that higher lymph node ratio is an indicator of poor prognosis and we want to see if the gene mutation is associated with more aggressive tumors.

My initial thought was to use a t-test or wilcoxon test, but based on my (limited) knowledge of the beta distribution, I think simply using the ratio is throwing away information about the ratio's uncertainty for each patient that is inherent to the count data. This makes me thing beta regression is possibly the better approach.

I have found the beta regression library in R (betareg) that performs beta regression, but this accepts the outcome variable as a ratio. I can't tell how to incorporate the total number of lymph nodes into the estimation procedure or if this is even possible with beta regression. I wasn't sure if it would be appropriate to set the weights parameter to be the total number of lymph nodes in a similar fashion to how logistic regression with aggregate data is performed with glm.

Can the total number of successes and failures be incorporated into beta regression, and if so how?

Note that beta regression, at least in its usual implementation, does not handle cases in which 0 nodes or all nodes are positive. Binomial regression, as suggested in an answer, does. Do consider, however, that the number of lymph nodes sampled also can be a measure of disease severity, as a surgeon will generally dissect more nodes from an individual suspected of having more advanced disease. I'm not sure that only using the ratio of positive/total nodes is a good measure of severity here; you might be better off considering the actual number of positive nodes. — EdM, Dec 14 '23 at 16:44
This is a good idea - I'll make sure to investigate this further! — Tomas Bencomo, Dec 18 '23 at 00:30

PBulls · Accepted Answer · 2023-12-14T15:59:33.917

I do not think the previous - now removed - answer is correct.

The 'two-part' formula additionally models the scale parameter, i.e. heteroscedasticity. This has very little to do with the weight (foreshadowing) that you assign to each observation.

The larger issue is that an offset can technically be added to many GL(M)M implementations, it is just a constant in the linear predictor after all. Its interpretation only really makes sense under a log link however. Recall that the offset $t$ usually enters your model like so:

$$ g(\mu)=X\beta+h(t) $$

Here, $g$ is the link function for the response and $h$ the link function for the offset. When both of these are logarithms this allows you to turn your absolute response into a rate:

$$ log(\mu)=X\beta+log(t) \Leftrightarrow log(\mu/t)=X\beta $$

This only works because that's how logarithms subtract (becoming division in the original scale). If $g$ is the logit $log(\mu/(1-\mu))$ there isn't really a function $h$ that makes this interpretation make sense as far as I know, especially not for your use case. See several other questions that discuss this as well.

What to do then?

I would argue first and foremost that you don't really need beta regression, this is really just a collection of binomial experiments. You can model this perfectly fine using a binomial GLM and providing as the response a two-column matrix, where each row is the number of 'successes' $x$ and 'failures' $y$ (positive/negative nodes) for every subject.

What this will do internally is calculate the row-wise proportion of success $x/(x+y)$, and then assign a weight to each observation based on $x+y$ (the total number of trials in that observation). You can do exactly the same in betareg by just providing the total number of nodes per subject in the weights argument. A simulation example:

set.seed(1)
n <- 100
group <- sample(0:1, size=n, replace=TRUE)
nodes <- sample(10:110, size=n, replace=TRUE)
resp <- 1 + rbinom(n, nodes, .1+group*.3) ## Prevent prop == 0
prop <- resp/nodes
respmat <- cbind(resp, nodes-resp)
(fit1 <- betareg::betareg(prop ~ group, weights = nodes))
> Coefficients (mean model with logit link):
> (Intercept)        group

>      -2.033        1.684
(fit2 <- glm(respmat ~ group, family = binomial))
> Coefficients:
> (Intercept)        group

>      -2.041        1.692

Obviously the results do not match exactly because different likelihoods are being fit, but you can see that the predicted probabilities ($0.11$, $0.41$) match the theoretical ones ($0.10$, $0.40$) quite well. The discrepancy is further driven by the fact that I truncated the response to be >0, as the beta distribution does not support exactly $0$ or $1$ as an outcome (not a problem in the binomial model).

Beta regression with success and failure raw data

1 Answers1