Differing p-values car::Anova, anova & post hoc in logistic regression

Question

I am currently working with a logistic model in R and I have a question regarding the p-values that I get from different types of tests.

Since I have separation in my data (caused by X) I use the brglm2 package and my models look like this:

model1 = glm(Y ~ X + D + S, 
             family = binomial(link = "logit"),
             method = "brglm_fit",
             data = data )
model2 = glm(Y ~ X + D, 
             family = binomial(link = "logit"),
             method = "brglm_fit",
             data = data )

Where:

Y: 0 or 1
S: factor with 3 levels
D: factor with 2 levels

To test for example for the significance of S, I use:

anova(model1, model2, test="Chisq")

Result:

Analysis of Deviance Table
Model 1: Y ~ X + D
Model 2: Y ~ X + S + D
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       265    17.0218                       
2       263     8.1608  2    8.861  0.01191 *

As I understand this is the two models significantly differ from each other and that is the effect of S (or from leaving it out).

If I use the Anova function of the car package (type II or type III), I get slightly different results (depending on the model they can be totally different)

Anova(model1)
Analysis of Deviance Table (Type II tests)
Response: Y
       LR Chisq Df Pr(>Chisq)    
X       306.718  1    < 2e-16 ***
S         8.085  2    0.01756 *  
D         5.814  1    0.01590 *

If I now want to know the which levels of S are different from each other, I run a post-hoc test using the multcomp or the lsmeans package (both return the same results).

summary( glht(model1, linfct=mcp(S = "Tukey")) )

Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts

Fit: glm(formula = Y ~ X + S + D, family = binomial, 
         data = data, control = list(maxit = 10000), method = "brglm_fit")

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)
K - A == 0  -3.1136     1.9280  -1.615    0.239
R - A == 0  -3.7331     1.9983  -1.868    0.148
R - K == 0  -0.6195     1.7794  -0.348    0.935
(Adjusted p values reported -- single-step method)

I understand that the anova() function tests the difference between the models and that Anova() tests for the effects of the variables on the dependent variable with different types of SS and that the results from the two can therefore differ. I usually rely rather on the results obtained from the model comparison. But how does it happen that the post-hoc test is so far off? Is this not the appropriate way to test for the effect of the different levels of a factor in logistic regression?

I don't understand why you are testing for the significance of S using your current model2. Can you elaborate? — Isabella Ghement, May 18 '18 at 15:28
Oh, I made a mistake. These are the models I want to test: model1 = glm(Y~X + D + S, ....) model2 = glm(Y~X + D, ...) Sorry! — mela, May 18 '18 at 18:07
Can you confirm that you edited the anova and post-hoc testing outputs to reflect your updated models? — Isabella Ghement, May 18 '18 at 18:15
Can you post the code for the car:Anova test? It may also help to post your data (or more generally, any reproducible example), if possible. I see no reason the Tukey tests should look like the former tests, but the 1st 2 should be similar. — gung - Reinstate Monica, May 18 '18 at 18:28
I could reproduce the problem now. The car:Anova does not correctly implement the brglmFit function. Effectively it compares model1 with brglmfit to reduced models like model 2 *without brglmfit*. So regarding this part, it is a buggy implementation of the brglmfit procedure into the glm function and how related methods in R interpret this (anova just picks the deviance scores from glm, on the other hand Anova has to recalculate the glm for the different reduced models and forgets to pass the method=brglmfit stuff). — Sextus Empiricus, May 20 '18 at 10:03
Regarding the Tukey test, this is a whole different (much more interesting) question. I suggest to have this split up as it deserves a question on it's own (also for the sake of search-ability for people that may have the same question in the future). In short: The main point with this seems to be that by splitting up the (grouped) anova test into single t-tests may reduce the power in specific cases (especially if the differences are shared among groups) .... continue in next comment — Sextus Empiricus, May 20 '18 at 10:07
..... (continued) Imagine if you would have ten different groups. Five of them cluster around the intercept at zero. Five of them cluster around the effect of one. In an anova this might generate a considerable influence. You will find many contrasts to be slightly significant. On an individual basis slightly significant (you got p=0.2 and p=0.1) might be rejected. But if you have many of them then something more might be going one (this is what anova tells). In some procedures like the Holm method these cases are better dealt with. — Sextus Empiricus, May 20 '18 at 10:15
Note that for a certain maximal spread between groups, the between-group-variance (related to anova and likelihood) is much higher when the groups cluster towards the ends of the spread in comparison to when the groups are spread among the entire range. So splitting up a grouped test into multiple individual tests may lead to different decision boundaries when you have either this clustering or even spreading. — Sextus Empiricus, May 20 '18 at 10:26
In fairness to car::Anova, it doesn't claim to be able to handle model objects from brglm2. This is something that R users need to be cautious about. Another example with car:Anova is model objects from the ordinal package. Last I checked, it gives strange results unless the RVAideMemoire package is used with it. As a further note, I looked quickly at the documentation and vignettes for the brglm2 package. I didn't see any use of anova or comparison of models, so you might be cautious using anova as well. — Sal Mangiafico, May 20 '18 at 13:58
@SalMangiafico brglm2 is an option passed on to glm (providing an alternative brglmFit function used in the fitting, instead of glm.fit) it does not return an object independently by itself. — Sextus Empiricus, May 20 '18 at 14:25
@MartijnWeterings, either way, if brglm2 doesn't mention Anova, and Anova doesn't mention brglm2, it's a gamble that they will work correctly in conjunction --- as we see --- unless the user examines the code or does some testing. It would be a service for the authors of brglm2 to address this question directly. (As I said, in a quick look I didn't see anything in the documentation using anova). Personally I would like to see car::Anova updated to handle more types of models correctly, as it is so valuable. I love how emmeans can handle so many model types and lists them. — Sal Mangiafico, May 20 '18 at 14:45
The package brglm documentation actually mentions that methods for model comparison, such as anova, are to be considered experimental. Yet, this relates more to the theoretical side. Or, at least this issue with different p-values (which is practical and not a theoretical issue) leads to a discussion split into two issues 1) behavior of model comparison for brglm2 is not described theoretically (although this does not explain the difference, but is just general piece of advise) 2) The car::Anova package ignores the method parameter, and only works with vanilla (old-fasioned) glm. — Sextus Empiricus, May 20 '18 at 16:21
Actually, the author of the brglm/brglm2 package told me that the anova() should work with brglm2. So, if I only have 3 levels of a factor, can such clustering still be a problem for the Tukey test? Are there any alternatives I could try out? — mela, May 22 '18 at 09:41
Is there any advice you can give me for reporting my results? — mela, May 22 '18 at 09:46

score 2 · Answer 1 · edited Jun 11 '20 at 14:32

p-values brglm2

The car::Anova package does not support the use of the method=brglmFit (or any other alternative method) and ignores this method parameter (which has only been fully introduced recently). The car::Anova function will copy the information from the full model created by your glm-call, which you calculated implementing the method. But it will calculate by itself the reduced models and it does not copy the method variable in the calls to the glm function.

model1 = glm(Y ~ X + S + D, 
             family = binomial(link = "logit"),
             method = "brglmFit", 
             data = data)
model2 = glm(Y ~ X + D, 
             family = binomial(link = "logit"),
             method = "brglmFit", 
             data = data )
model2a = glm(Y ~ X + D, 
             family = binomial(link = "logit"),
             data = data )
anova(model1,model2,test="Chisq")
anova(model1,model2a,test="Chisq")
Anova(model1)

The Anova(model1) will correspond to anova(model1,model2a,test="Chisq"), but not to anova(model1,model2,test="Chisq")

p-values anova vs post-hoc

Regarding the Tukey test. The main point with this seems to be that by splitting up the (grouped) anova test into single t-tests may reduce the power in specific cases (especially if the differences are shared among groups).

Imagine if you would have ten different groups with different type of distributions between them (but the same between groups variance).

Five of them cluster around the intercept at zero. Five of them cluster around the effect of one.
Or eight of them cluster in the middle and one of them is very large and one of them is very small.

For a Tukey-test, the second case (with one single very large difference) will be much more likely to make a significant rejection of the hypothesis of equal means in comparison to the first case (and as compensation, for type I error rate, the first case will have a higher p-value).

In an anova, the first case (many small/medium differences) may also generate a considerable influence. You will find many contrasts to be (only) slightly significant. On an individual basis slightly significant (you got p=0.2 and p=0.1) might be rejected. But if you have many of them then something more might be going one (this is what anova tells). In some procedures like the Holm method these cases are better dealt with.

see more:

I am not 100% sure whether the Tukey-anova difference is completely explained. While it is clear that there are differences, it is not certain whether such large differences (0.148 and 0.239 for Tukey versus 0.01756 for F-test) are plausible. Some simulation might show this, and I may perform this one day (In the links is a simulation, but for a simpler one-way-anova. It does not explain the discrepancy shown here). — Sextus Empiricus, Jun 21 '18 at 23:48

score 1 · Answer 2 · answered May 23 '18 at 20:16

The post-hoc test is not "off". It's telling you exactly what is in the data.

The post-hoc has low power because levels $K$ and $R$ are so close to one another and because you are correcting for multiple comparisons. When testing nested models, $S$ groups of $K$ and $R$ both strongly influence the common log odds of response under the null. You would achieve similar power to the global test using ANOVA or LRT if you collapsed $K$ and $R$ which would in turn eliminate the need for multiple comparisons (although it is a data-driven--as opposed to hypothesis-driven--decision).

It should be clear that the simple null hypothesis of the global test has great power because there is only one hypothesis to test, and it can be egregiously violated in a number of ways. The post-hoc test is a logical follow-up when you want to know in which way(s) it was violated. Since this post-hoc test only considers pairwise effects, you miss the chance to test the hypothesis $\mathcal{H}_0: \mu_{K,R} = \mu_A$ (that is that K,R share a common log oddswhich is assumed to be identical to that of A).

Differing p-values car::Anova, anova & post hoc in logistic regression

2 Answers2

p-values brglm2

p-values anova vs post-hoc

Linked