4

What would be the best solutions to analyse proportions of multiple classes.

The classes are mutually exclusive, with each observation falling into one of the classes (i.e. no other class is possible) and are unordered.

For example, having the following data, where observations in the same experiments are dependent (i.e. for the same Exp A+B+C = 100% in each condition)

Percentage Counts Class Group Exp
80 2475 A Control 1
81 2906 A Control 2
76 5102 A Control 3
11 351 B Control 1
11 405 B Control 2
14 938 B Control 3
9 280 C Control 1
8 282 C Control 2
10 639 C Control 3
75 2380 A Treated 1
76 2584 A Treated 2
69 2210 A Treated 3
15 480 B Treated 1
15 510 B Treated 2
16 530 B Treated 3
10 310 C Treated 1
9 290 C Treated 2
15 480 C Treated 3
prop_data <- data.frame(Percentage = c(80, 81, 76, 
                                       11, 11, 14, 
                                       9, 8, 10,
                                   75, 76, 69, 
                                   15, 15, 16, 
                                   10, 9, 15   
        ),
        Counts = c(2475, 2906, 5102,
                   351, 405, 938,
                   280, 282, 639,
                   2380, 2584, 2210,
                   480, 510, 530,
                   310, 290, 480
        ),
        Class = factor(rep(c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;), each = 3)),
        Group = factor(rep(c(&quot;Control&quot;, &quot;Treated&quot;), each = 9)),
        Exp = 1:3)

I would like to see whether the proportion of A,B and C is dependent on the Group

I thought of doing something like

model <- glm(Percentage ~ Class * Group, data = prop_data, family = poisson)

I am not, however, completely sure whether this is correct. For example, I am not dealing with counts but rather with percentages (so, not integers) here. Also, should I include the experiments as a random factor (I tried using glmer but I get a singular fit)?

I have also considered using a multinomial model but am not quite sure how to set it up.

EDIT:

I have added raw counts to the R code, as suggested in the comments

nico
  • 4,581
  • 3
  • 32
  • 45
  • 1
    How is the percentage calculated? Do you have the raw counts or data from which this was computed? – Demetri Pananos Aug 22 '22 at 14:35
  • Yes, I could can get raw counts – nico Aug 22 '22 at 19:05
  • 2
    If you could post those instead, I think that would help a lot more. – Demetri Pananos Aug 22 '22 at 19:08
  • OK, I'll post them tomorrow (don't have them with me right now!) – nico Aug 22 '22 at 19:10
  • The Class labels in your original table don't agree with the annotations of the Percentage lines in your construction of prop_data, but they do agree with the way the Class values are set up in Class=factor(rep(c("A", "B", "C"), 3)). To line up instead with the annotations of the Percentage values, you would need Class=factor(rep(c("A", "B", "C"), each=3)). Which do you want? Makes a big difference in the results. – EdM Sep 18 '22 at 17:13
  • @EdM oh, sorry, yes forget about the comments my bad. This is not the real data anyway, I just want to get a general answer, don't care which way the results go in this particular example. – nico Sep 19 '22 at 07:43
  • There is still something that doesn't make sense in the sample data. You say "for the same Exp A+B+C = 100% in each condition" in the question, but what you have in your table is the sum of A+A+A = B+B+B = C+C+C =100% within each Group and Exp. I think that what you intend is for Class to be a multinomial outcome, but your example data aren't consistent with that. Why not show the real data? – EdM Sep 19 '22 at 12:31
  • @EdM Because the real data has other complications which are irrelevant to the question and I just want a theoretical answer on how to approach this type of question. Now the data is fixed, apologies for the mixup, I edited it in a hurry and did the wrong thing. – nico Sep 19 '22 at 13:50
  • 1
    Is there an inherent ordering to the Class values "A," "B," and "C," (like one is preferable to the next etc.) or are they simply 3 potential outcome classes without any order? As the answer depends on that, please edit the question to provide that information; comments are easy to overlook and can be deleted. – EdM Sep 19 '22 at 14:24
  • @EdM: added to the question, no specific ordering and each element is in one of the 3 classes (there is no other possible options, nor missing values) – nico Sep 20 '22 at 14:03

1 Answers1

4

With "A," "B," and "C" as mutually exclusive and unordered outcome classes in the trials of each combination of experiment (Exp) and treatment Group, this is a standard application of multinomial logistic regression.

One Class is taken as the reference and regression coefficients then express predictor-associated differences in log-odds of each of the other classes with respect to that reference class. This differs from a set of binomial regressions against the reference class, as the probabilities among classes are constrained to sum to 1. The principles are nicely outlined in this UCLA web page, which recommends the multinom() function in the R nnet package

As your data are already aggregated, you use the Counts as weights for each of the outcomes. Treat Exp as a fixed effect after making sure it's interpreted as a factor rather than numeric. The following allows for the effect of the treatment Group to differ among experiments.

library(net)
prop_data$Exp <- factor(prop_data$Exp)
mnMod <- multinom(Class~Group*Exp,data=prop_data,weights=Counts)
## some output about convergence is omitted
summary(mnMod)$coefficients
##   (Intercept) GroupTreated        Exp2      Exp3
## B   -1.953216    0.3521456 -0.01742925 0.2595691
## C   -2.179205    0.1409320 -0.15352828 0.1017433
##   GroupTreated:Exp2 GroupTreated:Exp3
## B      -0.004179719       -0.08637392
## C       0.004549808        0.40957042

The overall reference is for Class "A" under Group "Control" in Exp 1. Intercepts for "B" and "C" are log-odds differences from "A" at those reference predictor values. Regression coefficients are further log-odds differences associated with Group, Exp, and the interactions between them.

For these sample data you would need to pay attention to systematic differences among the experiments, as some Exp and Group:Exp (interaction) coefficients are quite significant:

summary(mnMod)$coefficients/summary(mnMod)$standard.errors
##   (Intercept) GroupTreated       Exp2     Exp3
## B   -34.24556     4.641310 -0.2237779 3.862921
## C   -34.56243     1.614329 -1.7310543 1.343344
##   GroupTreated:Exp2 GroupTreated:Exp3
## B       -0.04000224        -0.8928776
## C        0.03672621         3.7516681

The coefficient estimates are assumed to have a limiting normal distribution, so these ratios to their standard errors are z-scores. A ratio with absolute value > 1.96 would be considered significant at p < 0.05.

You might wonder what happened to all of the information about Class "A." That's implicit in the constraint that the probabilities among all classes sum to 1.

Post-modeling software like the emmeans package allows you to extract that information and perform appropriate post-hoc tests. For example, if one could ignore the experiment-specific outcomes and average over the experiments, you would get the following report of Class probabilities and standard errors as a function of Group:

library(emmeans)
emmeans(mnMod,~Class|Group)
## Group = Control:
##  Class   prob      SE df lower.CL upper.CL
##  A     0.7898 0.00368 12   0.7818   0.7979
##  B     0.1221 0.00295 12   0.1156   0.1285
##  C     0.0881 0.00257 12   0.0825   0.0937
## 
## Group = Treated:
##  Class   prob      SE df lower.CL upper.CL
##  A     0.7336 0.00446 12   0.7239   0.7433
##  B     0.1556 0.00367 12   0.1476   0.1636
##  C     0.1109 0.00317 12   0.1039   0.1178
## 
## Results are averaged over the levels of: Exp 
## Confidence level used: 0.95 

That suggests that "Treated" leads to an increase in probabilities of "B" and "C" with a corresponding decrease in probability of "A."

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thank you for the extensive answer @EdM. I am wondering, though, why you would choose to treat the experiment as a fixed effect rather than a random effect, given the nested nature of the data (and also the fact that the experiment is not really a predictor, but rather a nuisance factor)? – nico Sep 21 '22 at 08:16
  • @nico First, one typically only uses random effects when there are about 6 or more clusters. The sample data only had 3 clusters (experiments). Second, I don't think that multinom() in nnet allows for random effects. I thought it best to use the function explained by the UCLA web page, to get at the main point of the question--how to deal with a multinomial outcome. – EdM Sep 21 '22 at 13:35
  • Thanks EdM, I think I will have to dig a bit more into glmer for the random effects bit. I am not quite sure I agree with using random effects only if there are 6 or more clusters. I would expect the effects of pseudoreplication to be more problematic when there are fewer groups as there is likely to be a large inflating of degrees of freedom. – nico Sep 22 '22 at 13:06
  • @nico see this page and its links. The difference is whether you fit intercepts/slopes individually per cluster (fixed effects) or as a group with Gaussian distributions (random effects). Random effects will be poorly estimated with few clusters. To avoid pseudoreplication problems, set up statistical tests properly in a way that respects the hierarchy of the experimental design. For example, if different clusters receive different treatments, the treatment effects need to be compared against the among-cluster variance, not within-cluster. – EdM Sep 22 '22 at 14:19