Surrogate splits are referenced elsewhere on this site, but I don't find an explanation for what they are. E.g.:
how does rpart handle missing values in predictors?
How do decision tree learning algorithms deal with missing values (under the hood)
I use the documentation to rpart and these lecture notes to give an example of a surrogate split. I construct an example using rpart in R:
tmp_df <-
data.frame(Y = as.factor(c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0)),
weight = 10:1,
height = c(10:7, 5, 6, 4:1))
tmp_df$weight[3] <- NA
This generates the following data frame:
Y weight height
1 1 10 10
2 1 9 9
3 1 8 8
4 1 7 7
5 1 6 5
6 0 5 6
7 0 NA 4
8 0 3 3
9 0 2 2
10 0 1 1
It is clear through the intent of the construction that the cutpoint weight > 5.5 gives a perfect split for the categorical response Y if the weight variable has no missing values.
Now, an algorithm that ignores missing values will just discard row 7, and still obtain a split equivalent to weight > 5.5. The package rpart does not do this, it instead computes a surrogate split on the height variable, height > 3.5.
The idea behind this is as follows: Weight is obviously the best variable to split on. However, when Weight is missing, a split using Height is a good approximation to the split otherwise obtained using Weight.
Let's fit two models to demonstrate this, first, tm_0, a normal tree model with no surrogates, and tm a tree model using the default surrogate behaviour in rpart:
tm_0 <- rpart(Y ~ weight + height, data = tmp_df,
control = rpart.control(minsplit = 1,
minbucket=1,
cp=0,
maxdepth = 1,
usesurrogate = 0))
tm <- rpart(Y ~ weight + height, data = tmp_df,
control = rpart.control(minsplit =1,
minbucket=1,
cp=0,
maxdepth = 1))
We see that the splits are as I describe, from summary(tm):
Primary splits:
weight < 5.5 to the left, improve=4.444444, (1 missing)
height < 4.5 to the left, improve=3.333333, (0 missing)
Surrogate splits:
height < 3.5 to the left, agree=0.889, adj=0.75, (1 split)
Now, compare the predictions from these two models using the following new data:
tmp_new_df <- data.frame(weight = c(rep(NA_real_, 4), 3:6), height = rep(3:6, 2))
> tmp_new_df
weight height
1 NA 3
2 NA 4
3 NA 5
4 NA 6
5 3 3
6 4 4
7 5 5
8 6 6
Contrast predict(tm_0, newdata = tmp_new_df) (first column is probability of being in class 0):
0 1
1 0.5 0.5
2 0.5 0.5
3 0.5 0.5
4 0.5 0.5
5 1.0 0.0
6 1.0 0.0
7 1.0 0.0
8 0.0 1.0
to predict(tm, newdata = tmp_new_df):
0 1
1 1.0000000 0.0000000
2 0.1666667 0.8333333
3 0.1666667 0.8333333
4 0.1666667 0.8333333
5 1.0000000 0.0000000
6 1.0000000 0.0000000
7 1.0000000 0.0000000
8 0.1666667 0.8333333
In the first four rows, since weight has missing values, the decision tree tm_0 is unable to make a prediction using the split on weight, so returns the class membership ratio at the root node. In contrast, the tree tm using surrogate splits is able to use the height variable to give a more accurate prediction for these rows. However, note the difference in the latter four rows. The tree with surrogate splits is unable to give a 'perfect' prediction due to how observations with missing values in the predictors are aggregated in the terminal nodes. (See the documentation to rpart for more details).