Remove effect of a factor on continuous proportion data using regression in R

Question

I have a data set of continuous proportions which depend on a fixed-effect factor, e.g.:

df <- data.frame(
   x = c(0.1,0.2,0.12,0.3,0.2,0.35,0.5,0.3,0.45,0.6,0.54,0.77),
   f = as.factor(c(rep("A", 6), rep("B", 6)))
)

I'm looking for a method to remove the effect of the factor, e.g. if the residuals were normally distributed, you could achieve this with:

fit <- lm(x ~ f, df)
fit$residuals + fit$coef[1]

What would be the appropriate method for continuous proportion data? I would like the output (after removal of the estimated effect of the factor) to be on the same scale (0-1).

How do the probability values arise? Are they counts divided by total counts? Or are they continuous proportions? — Glen_b, Apr 21 '14 at 11:11
There is no assumption in linear regression about the "data" being normally distributed - the assumption refers to the model residuals. Dummy variables are never going to be normally distributed. — wcampbell, Apr 21 '14 at 13:55
Your example just gives the values the data would have if they had all been in category A. Is that what you want? I'm not sure I would say it is the same as "remove[ing] the effect of the factor". — gung - Reinstate Monica, Apr 21 '14 at 16:10
That wasn't clear, but yes the intention is to return a set of values as if they had all been in the first level of the factor. — waferthin, Apr 21 '14 at 16:24
Why would you fit a linear, constant variance model to data restricted to be between 0 and 1? The bounds make the assumptions untenable, except in very restricted circumstances. — Glen_b, Apr 21 '14 at 17:08
@Glen_b, with so few data & only 2 means, it's not clear it will make much difference in practice. An OLS regression just fits the 2 means, as would a LR w/ the right numbers of trials to make sense, & a beta regression does the same as well. — gung - Reinstate Monica, Apr 21 '14 at 17:56
@gung Fair point, though it might make a difference when trying to satisfy the conditions of the final sentence in the question. — Glen_b, Apr 21 '14 at 18:05
Just to be clear, the data is meant to serve as a simple example. In practice there will be more data points (hundreds not thousands though). The salient point to me is that a change of 0.01-0.02 or 0.98-0.99 are modelled to be more "significant" than e.g. 0.55-0.56 (if that makes sense). — waferthin, Apr 21 '14 at 21:00
One possibility is to fit a beta regression model and add the B residuals (after rescaling for group A) to the predicted value for A (which is the mean proportion in condition A). After having fiddled with this for a while, I can get this to work for the constant precision model, but not for a beta regression model with varying precision. — gung - Reinstate Monica, Apr 21 '14 at 21:41
Thanks @gung. I would be interested in seeing the code anyway. Since I'm not actually interested in estimating the effect of the factor per se, is it legitimate to logit transform the data, fit a standard linear model, then reverse-logit transform the residuals? — waferthin, Apr 22 '14 at 08:01

gung - Reinstate Monica · Accepted Answer · 2014-06-30T20:10:36.147

The big issue, I think, is that you need a model of the data to work with. These could be regression models or simply your data fit to a distribution. Your data could theoretically be consistent with a number of models / distributions (such as the normal distribution, proportions of successes arising from a binomial distribution with the appropriate total numbers of trials, or a beta distribution). Since you say that these are fake data just to illustrate the problem, and that you have continuous proportions, I will simply go with the idea that these proportions come from a beta distribution.

In R, you can model beta distributed data using the betareg package. You should read the vignette (pdf), which is very helpful for understanding the ideas about beta regression and how to use the package to do it. Note in particular that they parameterize the beta distribution differently than the typical parameterization. Instead of using $\alpha$ and $\beta$ (which are, e.g., the shape1 and shape2 arguments in the standard ?pbeta, etc., functions), they use the mean proportion, $\mu$, and the precision, $\phi$, which are defined as follows:

\begin{align} \mu &= \frac{\alpha}{(\alpha + \beta)} &\text{and,}& &\phi = \alpha + \beta \\ \ &\ &\ & &\ \\ & &\text{thus,}& & \\ \ &\ &\ & &\ \\ \alpha &= \mu\times\phi &\text{and,}& &\beta = \phi - \alpha \end{align}

We can use these equations to move between the two parameterizations and therefore the two sets of functions. Then we can link the modeling in betareg to the distribution and quantile functions. The operative transformation will be based on an analogy to how qq-plots can be used to compare two distributions.

First, let's model these data with betareg and determine if the precision is constant between the groups:

library(betareg)
library(lmtest)
  # test of constant precision based on beta distribution:
cp = betareg(x~f, df)
vp = betareg(x~f|f, df)
lrtest(cp, vp)
# Likelihood ratio test
# 
# Model 1: x ~ f
# Model 2: x ~ f | f
#   #Df LogLik Df  Chisq Pr(>Chisq)
# 1   3 9.2136                     
# 2   4 9.4793  1 0.5314      0.466

For these (fake) data, there is little reason to believe that the precision differs, but for your real data, they might. So I will demonstrate the more complicated version, which you could simplify if you want. At any rate, the next step is to determine the estimated $\alpha$ (shape1) and $\beta$ (shape2) parameters for A and B by converting the model's coefficients to the distribution parameters (don't forget the link functions!) and using the above formulas to convert between the two parameterizations:

summary(vp)
# ... 
# Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)  -1.3153     0.2165  -6.074 1.25e-09 ***
# fB            1.4260     0.3178   4.487 7.23e-06 ***
#   
# Phi coefficients (precision model with log link):
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   3.0094     0.5695   5.284 1.26e-07 ***
# fB           -0.5848     0.7943  -0.736    0.462    
# ... 
lo.to.prop = function(x){
  o = exp(x)
  p = o / (o+1)
  return(p)
}
alpha.A = lo.to.prop(coef(vp)[1]            ) * exp(coef(vp)[3]            )
alpha.B = lo.to.prop(coef(vp)[2]+coef(vp)[1]) * exp(coef(vp)[4]+coef(vp)[3])
beta.A  = exp(       coef(vp)[3]            ) - alpha.A
beta.B  = exp(       coef(vp)[4]+coef(vp)[3]) - alpha.B

Due to the complexity of this, we might do a common-sense check on the final values. (This also suggests another--simpler--way to get these parameters, if you don't need to model them and/or don't need to test if the precisions are the same.)

cbind(c(alpha.A, beta.A), c(alpha.B, beta.B))
#                        [,1]     [,2]
# (Intercept)        4.290073 5.960932
# (phi)_(Intercept) 15.984533 5.336616

  # parameterization for beta distribution:
library(fitdistrplus)
fitdist(with(df, x[f=="A"]), "beta")
# Fitting of the distribution ' beta ' by maximum likelihood 
# Parameters:
#         estimate Std. Error
# shape1  4.290883   2.389871
# shape2 15.987505   9.295580
fitdist(with(df, x[f=="B"]), "beta")
# Fitting of the distribution ' beta ' by maximum likelihood 
# Parameters:
#        estimate Std. Error
# shape1 5.960404   3.379447
# shape2 5.335963   3.010541

Now, we convert the raw proportions in B to their corresponding beta probability (that is, how far through their own beta distribution they lie), and convert those probabilities to quantiles of A's beta distribution:

p.B      = pbeta(df$x[df$f=="B"], shape1=alpha.B, shape2=beta.B)
q.B.as.A = qbeta(p.B,             shape1=alpha.A, shape2=beta.A)

For a final step, let's take a look at the new values:

cbind(df$x[df$f=="B"], q.B.as.A)
#             q.B.as.A
# [1,] 0.50 0.18502881
# [2,] 0.30 0.08784683
# [3,] 0.45 0.15784496
# [4,] 0.60 0.24648412
# [5,] 0.54 0.20838604
# [6,] 0.77 0.38246863

enter image description here

Remove effect of a factor on continuous proportion data using regression in R

1 Answers1

Linked