I am trying to use nested models to investigate the influence of 5 factors on my dependent variable. I am not interested in interactions, only the influence of each variable taken separately. My dependent variable is part4Auto, and my independent variables are:
- part1FlyingHours
- part1TypePilot
- part3SUS
- part5MWQ
So I wrote this nested model sequence which gave me the following output:
> model.baseline = lm(part4Auto ~ 1, data)
> model.1 = update(model.baseline, .~. + part1FlyingHours)
> model.2 = update(model.1, .~. + part1TypePilot)
> model.3 = update(model.2, .~. + part3SUS)
> model.4 = update(model.3, .~. + part5MWQ)
> anova(model.baseline, model.1, model.2, model.3, model.4)
Analysis of Variance Table
Model 1: part4Auto ~ 1
Model 2: part4Auto ~ part1FlyingHours
Model 3: part4Auto ~ part1FlyingHours + part1TypePilot
Model 4: part4Auto ~ part1FlyingHours + part1TypePilot + part3SUS
Model 5: part4Auto ~ part1FlyingHours + part1TypePilot + part3SUS + part5MWQ
Res.Df RSS Df Sum of Sq F Pr(>F)
1 41 22.562
2 40 21.578 1 0.9846 3.2352 0.080460 .
3 38 19.665 2 1.9125 3.1419 0.055249 .
4 37 13.418 1 6.2477 20.5279 6.241e-05 ***
5 36 10.957 1 2.4612 8.0866 0.007306 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The problem is that when I change the order of independent variables, I obtain different results which almost modify my conclusion (here I exchanged part1FlyingHours and part5MWQ):
> model.baseline = lm(part4Auto ~ 1, data)
> model.1 = update(model.baseline, .~. + part5MWQ)
> model.2 = update(model.1, .~. + part1TypePilot)
> model.3 = update(model.2, .~. + part3SUS)
> model.4 = update(model.3, .~. + part1FlyingHours)
> anova(model.baseline, model.1, model.2, model.3, model.4)
Analysis of Variance Table
Model 1: part4Auto ~ 1
Model 2: part4Auto ~ part5MWQ
Model 3: part4Auto ~ part5MWQ + part1TypePilot
Model 4: part4Auto ~ part5MWQ + part1TypePilot + part3SUS
Model 5: part4Auto ~ part5MWQ + part1TypePilot + part3SUS + part1FlyingHours
Res.Df RSS Df Sum of Sq F Pr(>F)
1 41 22.562
2 40 15.226 1 7.3367 24.1063 1.979e-05 ***
3 38 14.680 2 0.5462 0.8973 0.416588
4 37 10.978 1 3.7015 12.1619 0.001304 **
5 36 10.957 1 0.0215 0.0707 0.791882
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
On the contrary, the output given by summary() does not change.
So my question is: for nested models, does the order of introduction of variables change the results so much? Or do I have an important flaw (like unbalanced data)? And if it is the first solution, what can I do to ensure that my results are not biased or incomplete?
1/ I may have missed something, but I think that your two models
m.fullandm.fullv2are actually exactly the same, it is then only logical that they give the same output.2/ with the
– Pyxel Sep 14 '18 at 06:36Anova()function, do I have to test each possible model? Put differently, do I have to try the baseline against each factor likelm(prestige ~ 1)vslm(prestige ~ education )(which may still be significantly better, despite non-significance in the full model)?anova(),m.fullandm.fullv2don't give the same output despite being the same model, and this is because this single model form ofanova()is doing the kind of tests that you were doing by hand with the multi-model form ofanova(). The main point here is that sequential tests like this are order dependent, and not all hypotheses are useful (testingeducationwithout controlling forincomeortypefor example.) – Gavin Simpson Sep 14 '18 at 15:27Anova()you fit one model with all the covariates/effects you have hypothesised to be relevant, then you can test their effects given the other terms included. Also, rather than see this as a selection step, as usεr11852 says, use regularisation to shrink the estimated coefficient values; those that get shrunk the most are least useful. – Gavin Simpson Sep 14 '18 at 15:30m.full <- lm(prestige ~ education + log2(income) + type, data = na.omit(Prestige))andm.fullv2 <- lm(prestige ~ education + log2(income) + type, data = na.omit(Prestige))which are the same model, independently of what anova we used after. I think you may have wanted to exchange two predictors?.. – Pyxel Sep 14 '18 at 17:21m.fullv2and I have now edited the code and output to do what I originally intended to show. – Gavin Simpson Sep 14 '18 at 18:31