0

I have a categorical factor with 100 levels and 100 different proportions. I would like to test (a) whether these proportions differ from 50%, and (b) if any of the levels in particular differ more from 50% than others.

I was thinking I could use a binomial generalized linear model to predict proportions. The intercept in the regression model would tell me whether any of the levels differ from 0, right?

props <- runif(148, 0, 1)
x <- 1:148
df <- data.frame(x, props)
m <- glm(props ~ 1, data=df, 
          family=binomial)
summary(m)

This intercept above would indicate whether the average proportion differs from 0.

But how would I be able to tell whether any of the proportions in particular differ more or less from 50% than others? I was considering an effects coded categorical factor but that seems potentially odd since the categorical indicator/level is different for every single row. It also provides a ton of different coefficients.

df$group <- as.factor(df$group)
contrasts(df$group) <- contr.sum(148)
m <- glm(props ~ group, data=df, 
         family=binomial)
summary(m)

For each group, I only have one observation. This may be odd but this was actually originally multilevel data with one instance of each group within each cluster. I wanted to focus on the groups and if the average count (i.e., proportion) per group differ from 50% so I collapsed the counts/binaries across clusters, giving me these proportions.

Is there a better way to model this? I was considering a X2 goodness of fit test but the expected proportions there needs to sum to 1 which would not be the case if the expected proportions are all = .50.

JElder
  • 919
  • What form of data do you have? Continuous proportions, or 0's and 1's? The way you simulate data you seem to have continuous proportions, then a binomial model will not work. Also, for eacg group you have only ne observation. Is that the case with your real data? – kjetil b halvorsen May 05 '22 at 14:34
  • Hi, thanks for response-- I have values between 0 and 1. I was actually not sure what best way to model proportions as a DV would be. I found multiple resources online indicating (quasi-)binomial or Poisson would work. Is that not the case and you'd recommend a different model for proportions? – JElder May 05 '22 at 18:43
  • For each group, I indeed only have one observation. I know this is odd. This was actually originally multilevel data with one instance of each group within each cluster. I wanted to focus on the groups and if the average count (i.e., proportion) differ from 50% so I collapsed across clusters. Do you have a different recommendations? – JElder May 05 '22 at 18:45
  • Links suggesting binomial reg for proportion outcome: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860877/#:~:text=The%20logistic%20regression%20model%20is,of%20a%20logit%20link%20function.

    https://stats.oarc.ucla.edu/stata/faq/how-does-one-do-regression-when-the-dependent-variable-is-a-proportion/

    https://stats.stackexchange.com/questions/89734/glm-for-proportion-data-in-r

    – JElder May 05 '22 at 18:58
  • Yes, in principle you can use logistic regression with proportion data (called fractional response model), a better link is https://stats.stackexchange.com/questions/216122/what-is-the-difference-between-logistic-regression-and-fractional-response-regre. But I don't think that is a good idea in your case, with 1 obs per group ... Please edit your post with all the new info you have given in comments, and more. What is your goal? – kjetil b halvorsen May 06 '22 at 03:12

0 Answers0