Does the order of models (and so variables) matter in nested models?

Question

I am trying to use nested models to investigate the influence of 5 factors on my dependent variable. I am not interested in interactions, only the influence of each variable taken separately. My dependent variable is part4Auto, and my independent variables are:

part1FlyingHours
part1TypePilot
part3SUS
part5MWQ

So I wrote this nested model sequence which gave me the following output:

> model.baseline = lm(part4Auto ~ 1, data)
> model.1 = update(model.baseline, .~. + part1FlyingHours)
> model.2 = update(model.1, .~. + part1TypePilot)
> model.3 = update(model.2, .~. + part3SUS)
> model.4 = update(model.3, .~. + part5MWQ)
> anova(model.baseline, model.1, model.2, model.3, model.4)

Analysis of Variance Table
Model 1: part4Auto ~ 1
Model 2: part4Auto ~ part1FlyingHours
Model 3: part4Auto ~ part1FlyingHours + part1TypePilot
Model 4: part4Auto ~ part1FlyingHours + part1TypePilot + part3SUS
Model 5: part4Auto ~ part1FlyingHours + part1TypePilot + part3SUS + part5MWQ
  Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
1     41 22.562                                   
2     40 21.578  1    0.9846  3.2352  0.080460 .  
3     38 19.665  2    1.9125  3.1419  0.055249 .  
4     37 13.418  1    6.2477 20.5279 6.241e-05 ***
5     36 10.957  1    2.4612  8.0866  0.007306 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The problem is that when I change the order of independent variables, I obtain different results which almost modify my conclusion (here I exchanged part1FlyingHours and part5MWQ):

> model.baseline = lm(part4Auto ~ 1, data)
> model.1 = update(model.baseline, .~. + part5MWQ)
> model.2 = update(model.1, .~. + part1TypePilot)
> model.3 = update(model.2, .~. + part3SUS)
> model.4 = update(model.3, .~. + part1FlyingHours)
> anova(model.baseline, model.1, model.2, model.3, model.4)

Analysis of Variance Table
Model 1: part4Auto ~ 1
Model 2: part4Auto ~ part5MWQ
Model 3: part4Auto ~ part5MWQ + part1TypePilot
Model 4: part4Auto ~ part5MWQ + part1TypePilot + part3SUS
Model 5: part4Auto ~ part5MWQ + part1TypePilot + part3SUS + part1FlyingHours
  Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
1     41 22.562                                   
2     40 15.226  1    7.3367 24.1063 1.979e-05 ***
3     38 14.680  2    0.5462  0.8973  0.416588    
4     37 10.978  1    3.7015 12.1619  0.001304 ** 
5     36 10.957  1    0.0215  0.0707  0.791882    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

On the contrary, the output given by summary() does not change.

So my question is: for nested models, does the order of introduction of variables change the results so much? Or do I have an important flaw (like unbalanced data)? And if it is the first solution, what can I do to ensure that my results are not biased or incomplete?

Gavin Simpson · Accepted Answer · 2018-09-14T18:30:12.700

Those are Type I or sequential sums of squares and yes, ordering generally matters. The sequence of tests is saying

Given we have an intercept and added A what was the effect of doing that? Then, given and intercept and A are in the model, what is the effect of adding B to that model?

If you change the order that A and B enter the formula you are now saying

Given we have an intercept and added B what was the effect of doing that? Then, given and intercept and B are in the model, what is the effect of adding A to that model?

Unless A and B are orthogonal to one another the effect of adding B to a model already containing A will not be the same as the effect of adding A to a model that already contains B.

As these are tests of adding a term to an existing model, the ordering matters.

You might look at the Anova() function in the car package for some alternative types of sums of squares tests: Type II and Type III.

Here's an reproducible example using some data from the car package:

library('car')
data(Prestige)

m.full <- lm(prestige ~ education + log2(income) + type, data = na.omit(Prestige))
anova(m.full)

This produces:

> anova(m.full)
Analysis of Variance Table

Response: prestige
             Df  Sum Sq Mean Sq  F value    Pr(>F)    
education     1 21282.5 21282.5 483.1865 < 2.2e-16 ***
log2(income)  1  2499.1  2499.1  56.7372 3.175e-11 ***
type          2   469.1   234.5   5.3247  0.006465 ** 
Residuals    93  4096.3    44.0                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Notice that the above is exactly the same as we get if we fit nested models in sequence (in the same order as the terms were in the formula for m.full of course) and use the multi-model form of anova():

m1 <- lm(prestige ~ education, data = na.omit(Prestige))
m2 <- update(m1, . ~ . + log2(income))
m3 <- update(m2, . ~ . + type)

anova(m1, m2, m3)

> anova(m1, m2, m3)
Analysis of Variance Table

Model 1: prestige ~ education
Model 2: prestige ~ education + log2(income)
Model 3: prestige ~ education + log2(income) + type
  Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
1     96 7064.4                                   
2     95 4565.4  1   2499.05 56.7372 3.175e-11 ***
3     93 4096.3  2    469.07  5.3247  0.006465 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Which I hope illustrates what is going:

Line 1 shows the RSS of model m1, the model with just education.
Line 2 shows the effect of adding log2(income) to m1 and provides an F test comparing those two models,
Line 3 shows the effect of adding type to the model in line 2, i.e. the effect of adding type to the model with education + log2(income)

These are exactly the same sequential tests of the decrease in sums of squares as the single model form of anova() gave you.

If we use the Anova() function and the single model form, then we get consistent tests, regardless of ordering:

## note different order
m.fullv2 <- lm(prestige ~ log2(income) + education + type, data = na.omit(Prestige))

> Anova(m.full)
Anova Table (Type II tests)

Response: prestige
             Sum Sq Df F value    Pr(>F)    
education    1285.0  1 29.1735 5.058e-07 ***
log2(income) 1643.8  1 37.3191 2.312e-08 ***
type          469.1  2  5.3247  0.006465 ** 
Residuals    4096.3 93                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Anova(m.fullv2)
Anova Table (Type II tests)

Response: prestige
             Sum Sq Df F value    Pr(>F)    
log2(income) 1643.8  1 37.3191 2.312e-08 ***
education    1285.0  1 29.1735 5.058e-07 ***
type          469.1  2  5.3247  0.006465 ** 
Residuals    4096.3 93                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In these tests, the first line in the output (labelled education) is a comparison of these two models:

prestige ~ log2(income) + type vs
prestige ~ education + log2(income) + type

The second line (labelled log2(income)) is a test of these two models:

prestige ~ education + type vs
prestige ~ education + log2(income) + type

and the third line (labelled type) is a test of these two models:

prestige ~ education + log2(income) vs
prestige ~ education + log2(income) + type

In each case we are controlling for the effects of the other two variables when we test the effect of the third variable.

Thank you for this detailed explanation (and the tip about anova directly with the full model). I still have two questions left:
1/ I may have missed something, but I think that your two models m.full and m.fullv2 are actually exactly the same, it is then only logical that they give the same output.

2/ with the Anova() function, do I have to test each possible model? Put differently, do I have to try the baseline against each factor like lm(prestige ~ 1) vs lm(prestige ~ education ) (which may still be significantly better, despite non-significance in the full model)? — Pyxel, Sep 14 '18 at 06:36
But with the single-model form of anova(), m.full and m.fullv2 don't give the same output despite being the same model, and this is because this single model form of anova() is doing the kind of tests that you were doing by hand with the multi-model form of anova(). The main point here is that sequential tests like this are order dependent, and not all hypotheses are useful (testing education without controlling for income or type for example.) — Gavin Simpson, Sep 14 '18 at 15:27
Re 2: no, with Anova() you fit one model with all the covariates/effects you have hypothesised to be relevant, then you can test their effects given the other terms included. Also, rather than see this as a selection step, as usεr11852 says, use regularisation to shrink the estimated coefficient values; those that get shrunk the most are least useful. — Gavin Simpson, Sep 14 '18 at 15:30
Ok, so if I sum everything up, the best idea (at least for my specific problem) is to use type II or type III SS and then use regularization to have the most robust output. I will investigate this, thank you guys. — Pyxel, Sep 14 '18 at 17:18
Gavin: for the model, if I copy paste, I have m.full <- lm(prestige ~ education + log2(income) + type, data = na.omit(Prestige)) and m.fullv2 <- lm(prestige ~ education + log2(income) + type, data = na.omit(Prestige)) which are the same model, independently of what anova we used after. I think you may have wanted to exchange two predictors?.. — Pyxel, Sep 14 '18 at 17:21
@Pyxel My apologies, I made an error when fiddling with the code as I developed the answer. The intention was to vary the order of terms in the formula for m.fullv2 and I have now edited the code and output to do what I originally intended to show. — Gavin Simpson, Sep 14 '18 at 18:31

score 3 · Answer 2 · answered Sep 13 '18 at 21:22

Gavin gave the correct answer about what is going on here and why the order is relevant (+1). I will briefly focus on the side question: "what can I do to ensure that my results are not biased or incomplete?" My suggestion would be not select, but to regularise.

Try using a form of regularised regression (e.g. LASSO) and then present how the coefficient paths are affected by the regularisation $\lambda$. There are a couple of nice threads on CV (e.g. Coefficients paths – comparison of ridge, lasso and elastic net regression, Why does the Lasso provide Variable Selection?) as well as some good and authoritative online resources on how to interpreter coefficient paths (e.g. Regularization Paths for Generalized Linear Models via Coordinate Descent by Friedman et al. (2010), The Elements of Statistical Learning by Hastie et al. (2009) Section 3.4 on Shrinkage Methods). Using such an approach we will automatically include "all the explanatory variables $X$" and then use a principled way to investigate the influence of these variables $X$ in our response variable $Y$.

Thank you for this precision! I do not know regularisation, so I will deep in to see what does it mean exactly. — Pyxel, Sep 14 '18 at 06:40
+1 Thanks; I missed that bit of the Q, but would have said the same thing. — Gavin Simpson, Sep 14 '18 at 15:23

Does the order of models (and so variables) matter in nested models?

2 Answers2