Modelling a proportion

Question

My outcome variable, 'sensitivity', is a continuous proportion ranging from 0 to 1, inclusive. For example, it indicates the percentage of instances in which my gold detector correctly identified the presence of gold when it was present. I do not have access to the original count data, but I know that each proportion comes from 100 instances originally.

Would doing the below be inappropriate?

Instances where sensitivity values were exactly 0 or 1 have been modified to 0.001 and 0.999 to apply logit transformation
Use a linear regression model with logit(sensitivity) as outcome variable? That is: lm(car::logit(sensitivity) ~ some predictors)

If this is fine, how can I interpret the coefficient? If this is not fine, why is that and what should be done instead?

Stack exchange posts that have similar questions but no clear answer include:

When do coefficients estimated by logistic and logit-linear regression differ?
Dealing with 0,1 values in a beta regression
Regression with percentage response variable (ratio of two counts but the counts themselves are not available) in R (not clear on the interpretation for the 2nd solution)

I have my doubts that $4/8$ should be treated the same as $50000/100000$, even though both equal $1/2$. Excluding the size could lose a lot of information. — Dave, Jan 23 '24 at 18:27
Please edit the question to say more about how the (outcome) proportions are measured and what they represent. That will make a difference in the solutions that might be reasonable. — EdM, Jan 23 '24 at 18:34
@EdM I have added an example of what the proportion may represent — CyG, Jan 23 '24 at 18:55
Use a binary regression model with counts of successes and failures. The R documentation for glm explains how to do this. — Sycorax, Jan 23 '24 at 19:04
Why not just do a logistic regression with the raw success/failure observations? Aggregating the observations into a sensitivity measurement loses information, distorts the observations near the 0 and 1 boundaries, and just generally makes your life harder. — Nobody, Jan 23 '24 at 19:04
I edited the post to inform that I do not have access to the initial count data. However, I have tens of thousands of observations from the aggregated data. — CyG, Jan 23 '24 at 19:11
You have tens of thousands of aggregations, is that right? Do you have any sense of how many instances compose each aggregation? Is it reasonable to assume that they all have the same number of instances, whatever that is? — gung - Reinstate Monica, Jan 23 '24 at 19:24
@gung yes that's correct. Each aggregation is composed of 100 instances. I assume that each observation should have the same weight. — CyG, Jan 23 '24 at 19:28
If you have the sample proportions (the aggregations), and you know that each aggregation is composed of 100 instances, then you can immediately infer the counts! — Sycorax, Jan 23 '24 at 19:40

gung - Reinstate Monica · Accepted Answer · 2024-01-23T19:54:11.447

12

You don't actually have a continuous proportion. That is a discrete proportion. The proportions you have are counts of successes out of 100 trials. You are fortunate that you know the number of trials for each value (and that they are all the same). You should determine the underlying counts by multiplying the proportions you have by 100. Some rounding may be needed, but it is unlikely to have much impact, especially with a lot of data. From there, you can run a logistic regression with the counts of successes and failures. You can see a little bit how this is done in my answer to Difference in output between SAS's proc genmod and R's glm or perhaps Test logistic regression model using residual deviance and degrees of freedom.

edited Jan 23 '24 at 19:54

answered Jan 23 '24 at 19:47

gung - Reinstate Monica

145,122

5

(+1) And to close the loop regarding OP’s kludge to avoid log(0), one notes that the regression will be sensitive to the choice of the small number chosen to avoid log(0). Regression on the counts avoids this source of bias. – Sycorax Jan 23 '24 at 22:11
Could you explain why? It seems that the matrix of successes and failures is just another way of looking at proportions? What's wrong with the logit transformation? What's right with using the matrix of sucess and failures (for N = 100) with a logitstic regression (glm)? – CyG Jan 24 '24 at 13:40
1

I tell you that I have a coin that shows heads 100% of the time. Do you believe me? What if I told you I flipped it one million times? Would you change your answer if I told you that I only flipped the coin once? One million out of one million has the same proportion as one out of one, but the precision of the estimator is wildly different. – Sycorax Jan 24 '24 at 14:50
Sure, however here don't we have the same weights for every observation (i.e. the coin was flipped 100 times each), so our outcome is essentially : Y = 10/100, 11/100, 0/100, 100/100, 95/100 etc...? – CyG Jan 24 '24 at 15:20
2

The logits of 100% and 0% are infinity and negative infinity, @CyG. To get out of this problem, you want to use some ad-hoc adjustment to just those two proportions, but leave all the others as they are. I don't know of a justification for this, nor do I know how to determine what the 'correct' adjustment should be. Why not just use the correct analysis and sidestep all of that? You have all of the information you need, you just need to perform one simple computation first. – gung - Reinstate Monica Jan 24 '24 at 18:11
@CyG Oh, I thought you were asking a different question. To see the problem with your kludge, scroll up to my first comment. – Sycorax Jan 24 '24 at 19:49
1

@CyG, note that my comment / reason is in addition to the also correct reason given by Sycorax that the SEs, CIs & p-values would be incorrect if each aggregation were treated as a single observation rather than comprised of 100 individual trials, even if you had no 100%s or 0%s in your dataset to worry about. – gung - Reinstate Monica Jan 24 '24 at 19:53
Thank you both for your comments. @Sycorax I am interested in understanding the difference between the two approaches. I understand that you are both recommending to use glm (binomial with logit link) with the success/failure counts. However, given that we have the same number of trials for each observation, what makes the glm approach correct and the lm approach inappropriate ? Also, how come the glm using a logit link function can handle the 0/100 and 100/100? – CyG Jan 25 '24 at 23:43
1

The logit link is applied to the probability estimates, not the sample proportions, so there's no log(0). All of the other answers to these questions are in this comment thread. – Sycorax Jan 26 '24 at 00:03

Modelling a proportion

1 Answers1

Linked