42

I am analyzing an experimental data set. The data consists of a paired vector of treatment type and a binomial outcome:

Treatment    Outcome
A            1
B            0
C            0
D            1
A            0
...

In the outcome column, 1 denotes a success and 0 denotes a failure. I'd like to figure out if the treatment significantly varies the outcome. There are 4 different treatments with each experiment repeated a large number of times (2000 for each treatment).

My question is, can I analyze the binary outcome using ANOVA? Or should I be using a chi-square test to check the binomial data? It seems like chi-square assumes the proportion would be be evenly split, which isn't the case. Another idea would be to summarize the data using the proportion of successes versus failures for each treatment and then to use a proportion test.

I'm curious to hear your recommendations for tests that make sense for these sorts of binomial success/failure experiments.

3 Answers3

22

No to ANOVA, which assumes a normally distributed outcome variable (among other things). There are "old school" transformations to consider, but I would prefer logistic regression (equivalent to a chi square when there is only one independent variable, as in your case). The advantage of using logistic regression over a chi square test is that you can easily use a linear contrast to compare specific levels of the treatment if you find a significant result to the overall test (type 3). For example A versus B, B versus C etc.

Update Added for clarity:

Taking data at hand (the post doc data set from Allison) and using the variable cits as follows, this was my point:

postdocData$citsBin <- ifelse(postdocData$cits>2, 3, postdocData$cits)
postdocData$citsBin <- as.factor(postdocData$citsBin)
ordered(postdocData$citsBin, levels=c("0", "1", "2", "3"))
contrasts(postdocData$citsBin) <- contr.treatment(4, base=4) 
    # set 4th level as reference
contrasts(postdocData$citsBin)
     #   1 2 3
     # 0 1 0 0
     # 1 0 1 0
     # 2 0 0 1
     # 3 0 0 0

fit the univariate logistic regression model

model.1 <- glm(pdoc~citsBin, data=postdocData, family=binomial(link="logit"))

library(car) # John Fox package car::Anova(model.1, test="LR", type="III") # type 3 analysis (SAS verbiage)

Response: pdoc

LR Chisq Df Pr(>Chisq)

citsBin 1.7977 3 0.6154

chisq.test(table(postdocData$citsBin, postdocData$pdoc)) # X-squared = 1.7957, df = 3, p-value = 0.6159

then can test differences in levels, such as:

contrast cits=0 minus cits=1 = 0

Ho: Beta_1 - Beta_2 = 0

cVec <- c(0,1,-1,0) car::linearHypothesis(model.1, cVec, verbose=TRUE)

B_Miner
  • 8,630
  • 2
    @user2040. I don't understand how would you do the "type 3" test? Is it something SAS related? (sorry my SAS knowledge is very limited). I would have done a logistic regression as you suggested but with 2 dummy variables. Also, given that I understand correctly, if you do logistic regression, testing if some or all the coefficients are 0 is done by deviance (or likelihood ratio) and it IS asymptotically Chi-Sq (not necessarily with df = 1) – suncoolsu Jan 03 '11 at 22:52
  • 1
    @suncoolsu: Yes, practically speaking you should get the same conclusion. I should not have said "equivalent" (I work with big data so they end up the same). I added some code in the answer to help clarify. – B_Miner Jan 03 '11 at 23:42
9

Maybe some consider it old-fashioned, but if you only want to test the null hypothesis of all groups having equal success probability, then you can define $X_k$ as number of successes in group $k$, $n_k$ as number of trials in group $k$, the estimated probability in group $k$ will be $\hat{p}_k=X_k/n_k$, and then use the variance-stabilizing transformation for the binomial, which is $$ g(p) = \arcsin \sqrt{p} $$ Such an approach was (at times) good enough for Fisher, so can be useful also today!

However, some modern authors are quite sceptical of the arcsine transformation, see for example The arcsine is asinine: the analysis of proportions in ecology by David I. Warton and Francis K. C. Hui.

But this authors are concerned with problems such as prediction, where they show the arcsine can lead to problems. If you are only concerned with hypothesis testing, it should be OK. A more modern approach could use logistic regression.

3

I would like to differ from what you think about Chi-Sq test. It is applicable even if the data is not binomial. It's based on the asymptotic normality of mle (in most of the cases).

I would do a logistic regression like this:

$$\log \frac {\hat{\pi}} {1-\hat{\pi}} = \beta_0 + \beta_1 \times D_1 + \beta_2 \times D_2$$

where

$D_1$ and $D_2$ are dummy variables. $D_1 = D_2 = 0 \implies A, D_1 = 1, D_2 = 0 \implies B, D_1 = 1 D_2 = 1 \implies C$

$$H_o : \beta_0 = \beta_1 = \beta_2 = 0$$

Is the ANOVA equivalent if there is a relation or not.

$$H_o : \beta_0 = 0$$

Is the test is A has some effect.

$$H_o : \beta_1 - \beta_0 = 0$$

Is the test is B has some effect.

$$H_o : \beta_2 - (\frac {\beta_0+\beta_1} {2}) = 0$$

Is the test is C has some effect.

Now you can do further contrasts to find our what you are interested in. It is still a chi-sq test, but with different degrees of freedom (3, 1, 1, and 1, respectively)

suncoolsu
  • 6,622