1

I'm a bit confused as the output of my model in R.

I have built a generalised estimating equation glm model aiming to see the effect of time (here coded as timestrat) on a variable called new1804. I have controlled for a range of other variables.

mf04 <- formula(new1804 ~ timestrat + urban  + 
    marital + ses + timeinsample+state +    
    dependancestrat + didattemptquitinlastyear + 
    plan2quit + agestrat + sex + weight23) 
geeInd04 <- geeglm(mf04, id=uniqid, 
    data=finaldf04, family=poisson, 
    corstr="independence") 

My confusion comes in the output of the model. ANOVA analysis says that timestrat is not significant. However, when looking at the summary- the coefficient and standard errors suggest that it is. I have calculated my upper and lower confidence interval for the coefficient as below and got the results of -0.2947138 for the lower bound and -0.0513192 for the upper bound.

If my 95% confidence interval for the coefficient is negative for both the upper and lower bound, why is it that ANOVA is returning a non significant result?

Calculation of upper and lower bounds:

lwrcoef <- estimate - 1.96*stderr
uprcoef <- estimate + 1.96*stderr

Summary of model:

Summary(geeInd04)

Call: geeglm(formula = mf04, family = poisson, data = finaldf04, id = uniqid, corstr = "independence")

Coefficients: Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.70201 0.11849 35.10 3.1e-09 *** timestrat1 -0.17302 0.06209 7.76 0.0053 ** urban2 -0.01993 0.04376 0.21 0.6488
marital2 -0.09071 0.08561 1.12 0.2893
marital3 0.01912 0.03451 0.31 0.5797
marital4 -0.05872 0.03356 3.06 0.0802 .
sesL -0.02625 0.02529 1.08 0.2994
timeinsample 0.09883 0.04358 5.14 0.0233 *
state19 0.03256 0.08025 0.16 0.6849
state23 0.15694 0.05939 6.98 0.0082 ** state27 0.10763 0.06275 2.94 0.0863 .
dependancestrat0 -0.00335 0.06022 0.00 0.9556
dependancestrat1 -0.02140 0.04915 0.19 0.6633
dependancestrat2 -0.02250 0.04619 0.24 0.6261
dependancestrat3 -0.07338 0.04452 2.72 0.0993 .
didattemptquitinlastyear1 -0.02050 0.03104 0.44 0.5090
plan2quit2 0.16451 0.06616 6.18 0.0129 *
plan2quit3 0.08602 0.06958 1.53 0.2163
plan2quit4 0.04591 0.06912 0.44 0.5066
agestrat40-55 0.00398 0.02541 0.02 0.8756
agestratOver 55 -0.05170 0.03324 2.42 0.1199
agestratUnder 30 0.03823 0.03150 1.47 0.2249
sex2 0.07681 0.02360 10.59 0.0011 ** weight23 0.04968 0.02310 4.63 0.0315 *


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = independence Estimated Scale Parameters:

        Estimate Std.err

(Intercept) 0.454 0.0228 Number of clusters: 270 Maximum cluster size: 99

Anova:

anova(geeInd04)
Analysis of 'Wald statistic' Table
Model: poisson, link: log
Response: new1804
Terms added sequentially (first to last)
                     Df    X2 P(&gt;|Chi|)    

timestrat 1 1.79 0.18109
urban 1 0.21 0.64487
marital 3 2.98 0.39431
ses 1 0.82 0.36383
timeinsample 1 3.75 0.05273 .
state 3 7.45 0.05885 .
dependancestrat 4 4.41 0.35280
didattemptquitinlastyear 1 0.02 0.89898
plan2quit 3 15.87 0.00120 ** agestrat 3 8.80 0.03211 *
sex 1 10.95 0.00094 *** weight23 1 4.63 0.03150 *


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary function (object, ...) UseMethod("summary") <bytecode: 0x7fab49a332b0> <environment: namespace:base> > summary(geeInd04)

  • 1
    anova says Terms added sequentially (first to last). The other function is testing in the presence of all other variables. – user2554330 Apr 01 '22 at 16:54
  • 1
    See this thread about different types of anova(). The documentation in geepack doesn't seem very clear about just what anova() method it invokes for its models, or whether you could use something other than this default "Type I" sequential ANOVA. – EdM Apr 01 '22 at 17:16

1 Answers1

1

The commenters have provided the answer...I'll enter it here. The first output is presenting the P-values for each individual coefficient in the model. For this output, this can be interpreted with the following $$H_0 \ :\ \beta_i = 0$$ and a significant finding suggests that the coefficient is indeed not zero. Alternatively, this could also be interpreted as $$H_0 \ :\ \text{model}_\text{wo}\ \text{ IS AS GOOD AS }\ \text{model}_\text{w}$$ where we are asking if the model without that one variable (but all of the other variables in the model) is essentially as good as the model with all the variables (including the one we are focusing on). Though the P-values are not exactly the same here (as they would be say with a ordinary multiple regression model), the same general idea can be applied to interpreting the P-values in either fashion. So, if the P-value is statistically significant, this means the coefficient is not zero OR the model improves with the addition of this variable compared to the model without it.

The key point is that this is a one-out analysis for every one of the variables listed (be they predictors or instrumental variables, e.g., your categorical predictors need more than one instrumental variable to represent them in the model). As a consequence of this, the order of entry of variables in this model does not matter.

However, with the anova(·) function, the order does matter (as the output indicates). In this case, if the time variable is added to the null model, there is not a statistically significant improvement in the model. But this is not the same comparison as suggested above. The output above is asking if time improves the model after ALL THE OTHER variables went in first.

So, the curious thing about this finding is that, on its own, time does not appear to be much of a good predictor. But, after controlling for the influence of the other variables, time is indeed a predictor of the remaining variability in the dependent variable.

Gregg H
  • 5,474