2

I've been trying to analyze the result from my experiment. But since I'm new to the field of statistics, I'm struggling in every step, including the interpretation of results.

I have 4 groups of subjects, and each subjects made a choice between 3 options(A or B or No choice). I wanted to show that "Being in Group4" changes choice from B to A. To test such hypothesis, I tried using logistic regression.

Since my dependent variable has more than 3 outcomes (A or B or NoChoice), I used multinomial logistic regression. And as my independent variable was also nominal (Group 1~4), I use one-hot encoding as shown below.

Group1 Group2 Group3 Group4 Choice
1 0 0 0 A
0 1 0 0 A
0 1 0 0 B
0 0 0 1 A
0 0 1 0 No choice
1 0 0 0 A
0 1 0 0 A

And because it uses one-hot encoding, I dropped "group 1" and proceeded with group 2, 3, 4 to avoid multicollinearity problem when making regression model.

So by using R, I managed to get the result below (I used "Choice B" as reference")

Call:
multinom(formula = Choice ~ Group2 + Group3 + 
    Group4, data = mydata)

Coefficients: (Intercept) Group2 Group3 Group4 A 0.1004105 -0.3031104 -0.3030879 -1.198999 NoChoice 7.7997855 -6.6829425 -7.2990311 -7.029537

Std. Errors: (Intercept) Group2 Group3 Group4 A 0.8790567 0.5607085 0.6935957 0.5222399 NoChoice 111.4725793 111.4715781 111.4716982 111.4717650

Residual Deviance: 96.76822 AIC: 112.7682

and then to check statistical significance, I calculated Wald z to get p-values below:

> zvalues <- summary(test)$coefficients /summary(test)$standard.errors
> pvalues <- pnorm(abs(zvalues), lower.tail = FALSE) * 2
> pvalues
         (Intercept)        Group2      Group3      Group4
A          0.9090592     0.5887939   0.6621254   0.02168292
NoChoice   0.9442172     0.9521939   0.9477928   0.94971781

But I'm confused with the result because I could not find another case yet where both x and y variables are nominal. So Here's my questions:

  1. When I report this result, can I say anything about the effect on being on Group1?

  2. Since p_value is below 0.05 only in the upper row of Group 4, is it correct to say that being on group 4 effects choice shift from B to A significantly whereas group 2,3 does not have significant effect?

  3. How do I interpret the result of Intercept? Can it be interpreted as the result of Group1?

  • I found out that with using "One-hot encoding + dropping a one column" is called "dummy coding". And coefficients for each group 2,3,4 is in reference of group 1. So question 1 is solved. – Roas Clack May 13 '22 at 01:35
  • You say that your hypothesis is "Being in Group4 changes choice from B to A". But each subject belongs to one group and makes one choice. Then how does anyone change their choice? I'm confused by your language about a shift when actually all you can compare are the probabilities to choose B (as the first and only choice) across different groups. – dipetkov May 14 '22 at 13:36
  • This sounds like it is more a contingency table analysis where you would show association between group and choice. You would need counts in the cells – Ralph Winters May 14 '22 at 17:40
  • @Ralph Winters It's true that a chi-square test can test the hypothesis that probability of B is the same in all groups. The difficulty is going beyond a test for independence to estimate the probabilities in each group and the differences between groups. It's also true that estimation might be more than what the question is asking. – dipetkov May 14 '22 at 19:01
  • @dipetkov Yes, this may or may not be a regression problem. I'm a little unclear on what the OP is saying. I am just offering association as a way to look at the relationship between groups and choices as an exploratory step – Ralph Winters May 14 '22 at 23:44

1 Answers1

1

Comment: As @Ralph Winters points out, the question "Is the probability of choosing B that same in all four groups?" can be answered by performing a chi-squared test of independence. This analysis can stand on its own and might be all the OP needs to complete his study. On the other hand, we might want to know how the probability of B differs across groups, eg. in what direction (less/more) and by how much. To answer such questions, we estimate the probabilities in each group and compare them.


You want to compare the levels of a categorical variable in terms of their effect on the response. It's easier to make such comparisons in terms contrasts than in terms of regression coefficients.

Specifically, you want to compare one of four groups, Group 4, to the other groups, Groups 1—3, in terms of the probability that a participant chooses "B" given the choices "A", "B" and "Neither".

You already know that, since the outcome is one of three pre-determined categories, an appropriate model is multinomial regression. So I focus on how to estimate the contrasts between Group 4 and the other three groups.

In fact, I formulate two comparisons:

[Q1] Does a participant from Group 4 choose B with a higher probability than participants in groups 1, 2 and 3?
[Q2] Does a participant from Group 4 choose B with a higher probability than the average probability for groups 1, 2 and 3?

The answer to these questions could be different if the probability of choosing B, $P_g(B)$, varies among groups $g=\{1,2,3\}$.

I use the emmeans package to answer questions Q1 and Q2 in terms of contrasts. For even more fun I create an uneven sample: Groups 1, 2, 3 and 4 account for 12.5%, 12.5%, 50% and 25% of the data, respectively. I specify the following probabilities for choosing "A", "B" and "Neither" in each group.

sizes <- n * c(.125, .125, 0.5, 0.25)

probs_Group1 <- c(0.8, 0.2, 0.1) # Probabilities of A, B, Neither probs_Group2 <- c(0.8, 0.2, 0.1) probs_Group3 <- c(0.4, 0.5, 0.1) probs_Group4 <- c(0.3, 0.6, 0.1)

The code to generate data is attached below

data #> A tibble: 10,000 × 3 #> Group Choice Group4 #> <chr> <chr> <chr> #> 1 1 A Other #> 2 1 A Other #> 3 1 A Other #> 4 1 A Other #> 5 1 B Other #> 6 1 A Other #> 7 1 A Other #> 8 1 A Other #> 9 1 A Other #> 10 1 A Other #> … with 9,990 more rows

Now let's answer question Q1. To do this we compare Group 4 to the other groups by modeling choice as a function of the indicator variable Group4 defined as "If participant is in Group 4, then 4 else Other". I name the levels 4 and Other rather than 1 and 0 so that the results are easier to interpret.

# Fit multinomial model for Choice by Factor
model1 <- multinom(Choice ~ Group4, data = data)

Make pairwise comparisons between factor levels pairs()

for each each choice by = &quot;Choice&quot;.

Do one-sided "greater than" hypothesis test.

pairs( emmeans(model1, ~ Choice | Group4, mode = "prob"), by = "Choice", side = ">" )

See the emmeans documentation to learn about the formula syntax.

#> Choice = B: #> contrast estimate SE df t.ratio p.value #> 4 - Other 0.20719 0.01128 4 18.372 <.0001 #> #> P values are right-tailed

A "contrast" is a statistical term for a comparison. In this case the probability to choose "B" is estimated to be .21 higher in Group4 than in the Other groups.

Since we simulated the data we know the true probabilities so we can check if the estimate agrees with the true difference. Other consists of 16.7% Group 1 with $P_1(B)$ = 0.2, 16.7% Group 2 with $P_2(B)$ = 0.2 and 66.7% Group 3 with $P_3(B)$ = 0.5. Therefore, the probability of choosing B in Other is the weighted probability 0.167 × 0.2 + 0.167 × 0.2 + 0.677 × 0.5 = 0.4. So the true difference between Group 4 and Other is 0.6 - 0.4 = 0.2 while the estimated difference is 0.21.

However, there are four times more participants from Group 3 than from Groups 1 and 2 each. So the probability of choosing "B" in the Other group is biased towards $P_3(B)$ which in the simulation is higher than both $P_1(B)$ and $P_2(B)$.

Question Q2 asks to compare the probability of B in Group 4 to the average probability for groups 1, 2 and 3. In this case the three groups contribute equally to the average even though they are sampled unevenly. So the probability of choosing "B" is the unweighted average (0.2 + 0.2 + 0.5) / 3 = 0.3. And the difference with Group 4 is 0.6 - 0.3 = 0.3.

Now let's answer question Q2. To do this we nest nest groups 1, 2 and 3 into a grouping factor Other.

# Fit multinomial model for Choice by Factor
model2 <- multinom(Choice ~ Group, data = data)

Group factor levels into higher-level nested categories.

grid <- ref_grid(model2) grid_grouped <- add_grouping( grid, "Nested", "Group", c("Other", "Other", "Other", "4") )

Make pairwise comparisons between the nested categories.

pairs( emmeans(grid_grouped, ~ Choice | Nested), by = "Choice", side = ">" )

See the emmeans documentation to learn about grouping factors.

#> Choice = B: #> contrast estimate SE df t.ratio p.value #> 4 - Other 0.3146 0.01131 8 27.821 <.0001 #> #> Results are averaged over the levels of: Group #> P values are right-tailed

The estimate of the contrast 0.31 is close to the true difference 0.3, which hopefully is some evidence that the intended comparison is estimated correctly.


R code to simulate a multinomial dataset for a study of four groups of participants and three choices.

library("nnet")
library("emmeans")
library("tidyverse")

set.seed(1234)

choices <- c("A", "B", "No choice")

Helper function to draw n samples from the multinomial distribution

over the three choices with probabilities given by the vector prob

randomize_choice <- function(n, prob) { r <- rmultinom(n, 1, prob) choices[seq_along(choices) %*% r] }

number of participants

n <- 10000

sizes <- n * c(.125, .125, 0.5, 0.25)

probs_Group1 <- c(0.8, 0.2, 0.1) probs_Group2 <- c(0.8, 0.2, 0.1) probs_Group3 <- c(0.4, 0.5, 0.1) probs_Group4 <- c(0.3, 0.6, 0.1)

data <- tibble( Group = rep(c("1", "2", "3", "4"), times = sizes), Choice = case_when( Group == "1" ~ randomize_choice(n, probs_Group1), Group == "2" ~ randomize_choice(n, probs_Group2), Group == "3" ~ randomize_choice(n, probs_Group3), Group == "4" ~ randomize_choice(n, probs_Group4), TRUE ~ NA_character_ ), Group4 = if_else(Group == "4", "4", "Other") )

dipetkov
  • 9,805
  • Thanks for the comment. I've never heard of em means before and I'll check it out. – Roas Clack May 23 '22 at 02:43
  • But there's few things that I did not write in question. Actually group1 is 'control group' or 'baseline' so I wanted to check if group 2,3,4 had the same effect on choice of only group4 had the effect. – Roas Clack May 23 '22 at 02:46
  • "There are few things I didn't write in the question." is not a great idea. I interpreted your questions as best as I can. However, emmeans is pretty flexible and you can formulate all kinds of contrasts (i.e. comparisons) with it. – dipetkov May 23 '22 at 05:39