Factor included based on AIC from anova, yet no significant comparisons using PostHoc

Question

Using step-wise model reduction I have reduced my full model down (Based on AIC and model comparisons using the anova() function in R. This resulted in the following model:

m1_DD4 <- glm(GotoPB ~ Pb_type + ZF_Pb + ZF_NotPb, data = DF_DD_5, family = binomial(link = "logit"))

Pb_type is a factor with 4 levels. ZF_Pb and ZF_NotPb are numerical. The summary of this model looks as follows:

Call:
glm(formula = GotoPB ~ Pb_type + ZF_Pb + ZF_NotPb, family = binomial(link = "logit"), 
    data = DF_DD_5)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.15642  -1.07365  -0.00006   1.02744   1.56573  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      0.013410   0.136495   0.098  0.92174    
Pb_typeNG        0.152678   0.163990   0.931  0.35184    
Pb_typeSilence -20.152316 452.945271  -0.044  0.96451    
Pb_typeSong      0.221061   0.156019   1.417  0.15652    
ZF_Pb            0.079849   0.013634   5.857 4.72e-09 ***
ZF_NotPb        -0.027021   0.007097  -3.807  0.00014 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2047.0  on 1559  degrees of freedom
Residual deviance: 1385.9  on 1554  degrees of freedom
AIC: 1397.9

Number of Fisher Scoring iterations: 18

From the model reduction/AIC I conclude that the model fit improves when Pb_type is included, and it should therefore be included. Yet here all are shown as non-significant. From what I understand/read, this is because the p-values reported here are from Wald tests, which tests if each coefficient is different from 0. Whereas the anova() function uses likelihood ratio tests, which indicates which model has a statistically better fit. Please correct me if I am wrong here.

I have also been told to not worry about this too much, and to rely on the outcome of anova(). But I would like to undestand why.

Under the assumption that Pb_type indeed matters, I now wish to determine how the different levels of the Pb_type differ when scored against each other using a PostHoc pair-wise comparison. I do this using the emmeans package, via the following formula:

emmeans::emmeans(m1_DD4, pairwise ~ Pb_type, adjust = "Tukey")

Which results in the following output:

$emmeans
 Pb_type  emmean      SE  df asymp.LCL asymp.UCL
 DC        0.128   0.120 Inf -1.72e-01     0.427
 NG        0.280   0.115 Inf -5.30e-03     0.566
 Silence -20.025 452.945 Inf -1.15e+03  1108.223
 Song      0.349   0.104 Inf  8.96e-02     0.608

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 4 estimates 

$contrasts
 contrast       estimate      SE  df z.ratio p.value
 DC - NG         -0.1527   0.164 Inf -0.931  0.7882 
 DC - Silence    20.1523 452.945 Inf  0.044  1.0000 
 DC - Song       -0.2211   0.156 Inf -1.417  0.4887 
 NG - Silence    20.3050 452.945 Inf  0.045  1.0000 
 NG - Song       -0.0684   0.152 Inf -0.449  0.9698 
 Silence - Song -20.3734 452.945 Inf -0.045  1.0000 

Results are given on the log odds ratio (not the response) scale. 
P value adjustment: tukey method for comparing a family of 4 estimates

Here a few things stand out:

My df = INF. From here, I undestand that this presents no issue and that this is how emmeans labels asymptotic results.
In both my original model summary and here, it can be seen that for Silence there is a huge Standard error.
None of the values are significant.

This leaves me with 2 questions:

Why is my standard error so high, and does this present a problem? To answer this it might be necessary to know more about my data. So in short: All Pb_type treatments are different playbacks, and the silence served as a control treatment. The response variable GotoPB is either Y/N and represents whether individuals approached the playback (Y) or not (N). In the case of silence, as there never truly was playback, all values for GotoPB are N.
How can none of the comparisons of Pb_type be significant, yet including it in the model results in a significantly better fit (with a much lower AIC). Is something wrong here? And if not, how should I interpret this?

Kind regards.

score 1 · Accepted Answer · answered Apr 04 '20 at 22:11

1

First, the standard error for the Silence condition is huge because you are trying to estimate $-\infty$; i.e., the logit of zero. As you explained, the response is defined to always be N under silence. Therefore, Silence is not an experimental condition, at least for this response, because you don't observe a response under that condition.

You should omit or subset-out all the Silence results from the dataset and re-fit the model without Silence as a level of Pb_type. Your AIC results, etc. may well be quite different, because all those zeros have destabilized everything.

Second, just because a term is selected in a model does not guarantee that any comparison of its levels is statistically significant. Keep in mind that the primary goal of fitting a statistical model is to understand the patterns. I suggest plotting the estimates, e.g. using emmip(), and generally being more descriptive about your findings. Consider including interactions among these factors. Put less emphasis on how many asterisks you get, and more on the patterns.

answered Apr 04 '20 at 22:11

Russ Lenth

20,271

I see the problem with Silence. I might reconsider how I analyse my data in the first place, as a direct comparison with Silence is desired. However, for now: My AIC was indeed different: without Silence Pb_type should not be included based on AIC. As for the second point, I still somewhat struggle to understand why this is the case. I realise p-values are not all-important. Does a factor being included based on AIC, but not testing sig. in PostHoc then imply a tendency? I could not get emmip() to work, instead I used plot(emmeans_obj, comparisons = True) and pwpp to clarify my results a bit. – R. Iersel Apr 07 '20 at 10:37
1

Since Silence is assumed to always have zero response, then just compare each other treatment with zero. Your silence condition is NOT a control in your experiment because you do not observe a response under that condition. AIC is a model-selection criterion, not a statistical test; so you can't expect it to have the properties of a statistical test. Statistical analysis establishes what is likely or unlikely, but doesn't actually prove anything. So math/logic laws like commutativity do not apply. – Russ Lenth Apr 07 '20 at 15:59
I don't think comparing to zero is an option for me: the setup is slightly different than stated for the sake of brevity. Context: Individuals were observed going to 3 locations. 2 of these equipped with speakers. One speaker would playback sound for 1 minute, followed by 1 minute of silence, then the other speaker would play (switching for 2.5h). Birds did not visit these 3 locations equally often. By comparing with Silence I have a baseline for visiting rates to compare with. Though I believe my current method is indeed flawed, I will strive to find a proper way to analyse my data. Thank you – R. Iersel Apr 09 '20 at 09:44

Factor included based on AIC from anova, yet no significant comparisons using PostHoc

1 Answers1