Meaning of Surrogate Split

Question

I tried to figure but couldn't on what happens when missing values are present in some predictor variable and we have to solve the problem of regression using Random Forest. What is the meaning of surrogate split to handle that?

It will be good if a snapshot of the working of algorithm in that part of missing data could be shown.

hello you have an excellent answer below - maybe click on that green button over there to award? — WestCoastProjects, Oct 02 '19 at 20:04

score 23 · Answer 1 · answered May 08 '18 at 05:13

Surrogate splits are referenced elsewhere on this site, but I don't find an explanation for what they are. E.g.:

how does rpart handle missing values in predictors?

How do decision tree learning algorithms deal with missing values (under the hood)

I use the documentation to rpart and these lecture notes to give an example of a surrogate split. I construct an example using rpart in R:

tmp_df <- 
    data.frame(Y = as.factor(c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0)),
               weight = 10:1,
               height = c(10:7, 5, 6, 4:1))

tmp_df$weight[3] <- NA

This generates the following data frame:

   Y weight height
1  1     10     10
2  1      9      9
3  1      8      8
4  1      7      7
5  1      6      5
6  0      5      6
7  0     NA      4
8  0      3      3
9  0      2      2
10 0      1      1

It is clear through the intent of the construction that the cutpoint weight > 5.5 gives a perfect split for the categorical response Y if the weight variable has no missing values.

Now, an algorithm that ignores missing values will just discard row 7, and still obtain a split equivalent to weight > 5.5. The package rpart does not do this, it instead computes a surrogate split on the height variable, height > 3.5.

The idea behind this is as follows: Weight is obviously the best variable to split on. However, when Weight is missing, a split using Height is a good approximation to the split otherwise obtained using Weight.

Let's fit two models to demonstrate this, first, tm_0, a normal tree model with no surrogates, and tm a tree model using the default surrogate behaviour in rpart:

tm_0 <- rpart(Y ~ weight + height, data = tmp_df, 
              control = rpart.control(minsplit = 1,
                                      minbucket=1, 
                                      cp=0, 
                                      maxdepth = 1, 
                                      usesurrogate = 0))

tm <- rpart(Y ~ weight + height, data = tmp_df, 
            control = rpart.control(minsplit =1,
                                    minbucket=1, 
                                    cp=0, 
                                    maxdepth = 1))

We see that the splits are as I describe, from summary(tm):

  Primary splits:
      weight < 5.5 to the left,  improve=4.444444, (1 missing)
      height < 4.5 to the left,  improve=3.333333, (0 missing)
  Surrogate splits:
      height < 3.5 to the left,  agree=0.889, adj=0.75, (1 split)

Now, compare the predictions from these two models using the following new data:

tmp_new_df <- data.frame(weight = c(rep(NA_real_, 4), 3:6), height = rep(3:6, 2))
> tmp_new_df
  weight height
1     NA      3
2     NA      4
3     NA      5
4     NA      6
5      3      3
6      4      4
7      5      5
8      6      6

Contrast predict(tm_0, newdata = tmp_new_df) (first column is probability of being in class 0):

to predict(tm, newdata = tmp_new_df):

          0         1
1 1.0000000 0.0000000
2 0.1666667 0.8333333
3 0.1666667 0.8333333
4 0.1666667 0.8333333
5 1.0000000 0.0000000
6 1.0000000 0.0000000
7 1.0000000 0.0000000
8 0.1666667 0.8333333

In the first four rows, since weight has missing values, the decision tree tm_0 is unable to make a prediction using the split on weight, so returns the class membership ratio at the root node. In contrast, the tree tm using surrogate splits is able to use the height variable to give a more accurate prediction for these rows. However, note the difference in the latter four rows. The tree with surrogate splits is unable to give a 'perfect' prediction due to how observations with missing values in the predictors are aggregated in the terminal nodes. (See the documentation to rpart for more details).

I wish all explanations about trees were like this! Well done! — user1700890, Nov 08 '19 at 21:28
Apparently, in the newer versions of rpart, they chantes something and the predictions I have in the second case are (giving only the "1"-column) 0 0 1 1 0 0 0 1. — Antoine, Jun 03 '20 at 11:09
Question: what would happen if the primary- and surrogate-variable are missing? (in my case, apparently, some kind of average is computed), however, I cannot figure out how. — Antoine, Jun 03 '20 at 11:10
Hi, unfortunately it has been more than two years since I've looked at this so I cannot answer your question. Perhaps you are better asking a new question tied specifically to the version of rpart you are referring to... with examples. — Alex, Jun 08 '20 at 23:43
@Antoine, I am also very much interested in your question. Judging by the usage of surrogates there will be no averages computed. If there is no surrogate, then algorithm will fail. This is my guess only. I might be wrong. — user1700890, Jun 10 '21 at 16:16

Meaning of Surrogate Split

1 Answers1

Linked