What is the best way to regress proportions (as both dependent and independent variables)?

Question

I am currently conducting a meta-analysis and have pooled the prevalence of a certain disease. I would like to check for any association of risk factors, such as gender, ethnicity, and disease classification (which I have all input as proportions), with said prevalence. May I ask what is the best way forward?

Proceed with extreme caution due to mathematical coupling when dealing with proportions. — Robert Long, May 27 '21 at 14:46
Robert, can you say more about what you mean by 'mathematical coupling'? — Aaron Springer, May 27 '21 at 20:48
Answering my own question: "The common problem in each type of mathematic coupling is that one variable either directly or indirectly contains the whole or components of the second variable" from this paper. Makes sense to me but not a term I was familiar with — Aaron Springer, May 27 '21 at 20:51
Maybe relevant: https://stats.stackexchange.com/questions/58664/ratios-in-regression-aka-questions-on-kronmal — kjetil b halvorsen, Jun 02 '21 at 15:42

Robert Long · Accepted Answer · 2021-05-28T14:38:03.030

Suppose we have a model such as

$$y = x$$

where $y$ and $x$ are some measurements in a number of samples. Now, if we introduce a third variable, something like a number of subjects in each sample or size of each population, $z$, and we wish to form another model so that we are dealing with proportions, we could have the model

$$\frac{y}{z} = \frac{x}{z}$$

it should now be obvious, that since $z$ appears in the denominator on both side, the two sides are "coupled", hence the term mathematical coupling.

A simple example in R can show this. For simplicity we simulate three variables from a standard normal distribution independently:

> set.seed(1)
> x <- rnorm(100)
> y <- rnorm(100)
> cor(x,y)
[1] -0.0009943199

...so the correlation is close to zero. Or in linear regression:

> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
    Min      1Q  Median      3Q     Max 
-1.8768 -0.6138 -0.1395  0.5394  2.3462
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03769    0.09699  -0.389    0.698
x           -0.00106    0.10773  -0.010    0.992
Residual standard error: 0.9628 on 98 degrees of freedom
Multiple R-squared:  9.887e-07, Adjusted R-squared:  -0.0102 
F-statistic: 9.689e-05 on 1 and 98 DF,  p-value: 0.9922

so the estimates are close to zero and so is R^2.

Now we introduce a third variable:

> z <- rnorm(100)
> cor(x/z, y/z)
[1] 0.9168795

and suddenly the correlation is above 0.9. Or in regression:

> summary(lm(I(y/z) ~ I(x/z)))
Call:
lm(formula = I(y/z) ~ I(x/z))
Residuals:
    Min      1Q  Median      3Q     Max 
-45.996  -4.733  -2.784  -1.524 214.929
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept)  2.74090    2.53884    1.08    0.283

I(x/z)       1.44965    0.06375   22.74   <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.35 on 98 degrees of freedom
Multiple R-squared:  0.8407,    Adjusted R-squared:  0.839 
F-statistic: 517.1 on 1 and 98 DF,  p-value: < 2.2e-16

...and the estimate for the slope is above zero with a very small p value, and the R^2 is 0.8407, which is 0.9168795^2

It is worth noting that this example is rather extreme because all the variables are standard normal, and this induces the largest possible effect of mathematical coupling. When the variables are on different scales, with different variances, of different types, or correlated with each other, the effect of mathematical coupling is less pronounced, but nevertheless still present.

So extreme caution is advised when dealing with proportions.

Wow, that's very informative! Thank you so much, Robert :D – Aaron Mai May 28 '21 at 10:52 — Aaron Mai, May 28 '21 at 10:52

What is the best way to regress proportions (as both dependent and independent variables)?

1 Answers1