Can dropping an insignificant factor from a model make the model worse?

Question

I constructed a negative binomial model for examining the relationship of 1 count variable="carid_den" on another "juv_cneb_den" (with an offset="Area_towed"), along with a factor of location ="Zone".

A summary command on my full model indicates all levels of the factor are statistically insignificant (>0.05). Upon dropping this factor, however, I get a slightly higher AIC value which I think means the factor somehow made the model better. Why would the AIC value drop if the factor wasn't important? Aren't lower AIC values an indication of a better model? Is there an intuitive explanation?

My data:

    > head(df)
           Zone TOTAL juv_cneb_count Area_towed
    1   Whipray     2              0   383.9854
    2      West    38              0   382.2256
    3 Crocodile    25              0   408.3697
    4    Rankin     2              0   422.1000
    5    Rankin     3              0   165.5196
    6      West     6              1   266.7000
> summary(nb_full)
Call:
glm.nb(formula = juv_cneb_count ~ TOTAL + Zone + offset(log(Area_towed)), 
    data = dat, init.theta = 0.2371440904, link = log)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max

-1.3378  -0.7787  -0.6540   0.0000   4.0603
Coefficients:
              Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.930e+01  1.575e+06   0.000   1.0000

TOTAL        1.946e-03  9.294e-04   2.094   0.0363 *
ZoneRankin   3.220e+01  1.575e+06   0.000   1.0000

ZoneWest     3.282e+01  1.575e+06   0.000   1.0000

ZoneWhipray  3.119e+01  1.575e+06   0.000   1.0000

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.2371) family taken to be 1)
Null deviance: 278.96  on 449  degrees of freedom

Residual deviance: 241.60  on 445  degrees of freedom
AIC: 751.89
Number of Fisher Scoring iterations: 1
          Theta:  0.2371 
      Std. Err.:  0.0407 


2 x log-likelihood:  -739.8900
> summary(base)
Call:
glm.nb(formula = juv_cneb_count ~ TOTAL + offset(log(Area_towed)), 
    data = dat, init.theta = 0.1965321662, link = log)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max

-1.4967  -0.6980  -0.6810  -0.5667   4.1964
Coefficients:
             Estimate Std. Error z value Pr(>|z|)

(Intercept) -6.776742   0.135157 -50.140  < 2e-16 ***
TOTAL        0.003362   0.000984   3.416 0.000634 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.1965) family taken to be 1)
Null deviance: 252.73  on 449  degrees of freedom

Residual deviance: 246.63  on 448  degrees of freedom
AIC: 775.16
Number of Fisher Scoring iterations: 1
          Theta:  0.1965 
      Std. Err.:  0.0329 


2 x log-likelihood:  -769.1590

It's the sum count of another species (a prey species to "juv_cneb_count"). — Nate, Apr 04 '22 at 20:15
This is an indirect comment about your model. If I understand the design correctly, you counted two species, juv_cneb and prey (carid?), in each area. The offset accounts for the fact that a bigger area has more organisms. But the offset "applies" to the count response only, not the count predictor. Here is a related question. — dipetkov, Apr 04 '22 at 22:00
That's actually a good point. Yes, the offset is supposed to be applied to both the response and predictor. — Nate, Apr 04 '22 at 22:48
Each species should actually be density (counts/area surveyed), but since the neg. binomial model only uses counts, I thought the offset would account for that. — Nate, Apr 04 '22 at 22:52
I don't think so and neither does the answer to that other question. Maybe figuring this out will help with some of the model fitting issues? — dipetkov, Apr 04 '22 at 22:55
This is not what is happening here, but as a comment for future visitors. It could happen that a truly insignificant factor yields a better model because makes convergence easier, for example, allowing regions between local minima. — Davidmh, Apr 05 '22 at 06:56

score 13 · Accepted Answer · answered Apr 04 '22 at 19:42

13

In this case you are relying on the wrong test to decide that Zone is not significant. Note that the coefficients of the Zone effect are large (>30) with huge standard errors. This happens when the likelihood keeps monotonically increasing as the estimate goes to infinity. In such cases the Wald test that gives you the z and p-values is useless. What is happening, I think, is that the Crocodile zone has 0 events, so the relative risk of the other zones compared to it is infinite.

If you were to do a likelihood ratio test for Zone as a covariate, you would see that it is significant (in fact, you pretty much did it by dropping the effect and looking at the likelihood again, you just did not compute the p-value), so you would not want to drop it.

answered Apr 04 '22 at 19:42

Aniko

11,014

Excellent point. I suspect that any other choice of reference value for Zone would have made things much clearer. – EdM Apr 04 '22 at 19:45
Ya, you're right! Croc didn't have any events (# of fish). In cases like this, is it also possible to tell R which level should be the reference? Btw, thank you! – Nate Apr 04 '22 at 19:55
I was able to change the ref. level, but none of the levels are significant for some reason. One is close...(p=0.0877). – Nate Apr 04 '22 at 20:02
2

It is quite possible that there are no differences among the other levels (but at least now you can see that). The key is that the level with 0 events seems to be very different, and GLMs with a log-link will have major trouble modeling it. You just cannot rely on the p-value from the parameter estimate table for any comparison involving that level, but you can use likelihood-based tests. There are workarounds with Firth's penalty, or a quick-and-dirty solution of adding 1 (or 0.5) event to that group, but I am not sure there are "easy" solutions. – Aniko Apr 04 '22 at 21:07
2

Never mind all the statistics. Do we really need a regression to tell us that there are fewer fish in a crocodile zone? – JDL Apr 05 '22 at 07:56
1

@JDL Crocodiles 1) eat fish 2) situate themselves to be near fish. It's not clear to me which of these two wins out before looking at the data :) – John Madden Apr 05 '22 at 19:19

Emma Jean · Answer 2 · 2022-04-04T19:14:06.510

AIC is a function of the number of parameters within your model k and its likelihood L. Formally, AIC = 2k - 2 ln(L). Since a smaller AIC is better, the term 2k serves as a penalty based on the number of parameters. Thus, AIC represents a trade-off between complexity (k) and fit (L). Imagine two models with similar likelihoods, but one model features 1000 parameters and another 2 (extreme example). The simpler model with 2 parameters is typically preferred (parsimony).

In this case, you have...

Model 1: k = 5+1, 2 ln(L) = -739.89 => AIC1 = 12 + 739.89
Model 2: k = 2+1, 2 ln(L) = -769.16 => AIC2 = 6 + 769.16

So even though you have reduced the model by 3 parameters, the likelihood or fit of your initial model was enough to offset the parameter penalty, resulting in a better AIC for the larger model.

Can dropping an insignificant factor from a model make the model worse?

2 Answers2