5

Let's say I have a response variable $Y$ satisfying a complicated tree-structure:

$$Y=f(X_1,X_2,X_3,\ldots,X_p) + \varepsilon$$

where:

  • $\varepsilon = 0$
  • $f$ is a known deep tree-structure function (not necessarily binary). That is to say we can have as many observations as we want.

My question is: Which highly interpretable algorithm can learn perfectly $f$? I first thought of CART, but my expriment on R via rpart doesn't give the desirable result.

I tried two parameters in rpart: $cp = -1$ or $0$ -> every split decreasing the overall lack of fit is attempted.

Metariat
  • 2,526
  • 4
  • 24
  • 43
  • What parameters did you use on rpart? What are the characteristics of the tree that generates your data? – Firebug Jul 04 '16 at 14:43
  • I updated the information in the question. As the characteristics of the tree, I asked about the tree structure in general, in no particular form. But does it change something? – Metariat Jul 04 '16 at 14:49
  • I'm not sure, but perhaps it's a tree so deep/complex you have to take some extreme parameters to recreate it. – Firebug Jul 04 '16 at 15:06
  • @Firebug What do you expect by "extreme parameters"? I had the same thought, that's why I tried cp = 0 at first then changed to -1, the result got better but still I didn't get $f$. Other parameters doesn't make sense to me in this context. – Metariat Jul 04 '16 at 15:10
  • minsplit/minbucket comes to mind, and also is depth beyond 30? – Firebug Jul 04 '16 at 15:20
  • @Firebug all these parameters will only limit the complexity of the tree if I understood correctly. – Metariat Jul 04 '16 at 15:34
  • Why add $\varepsilon$ if it's equal to $0$. Might you have meant $\operatorname{E}(\varepsilon) = 0$ rather then $\varepsilon=0$? $\qquad$ – Michael Hardy Jul 04 '16 at 17:14

2 Answers2

4

General purpose decision tree procedures (rpart included) do not attempt to find the tree that best fits the data. Instead they simply attempt to find a "good" tree by searching greedily.

Greedy means that, any time a split is under consideration, the best split is returned without considering any information from potential future splits. It is rather easy to concieve of a situation where this will fail to find the optimal tree

x_1 <- runif(100)
x_2 <- runif(100)

y <- ifelse(x_1 < .5,
            0.99 * ifelse(x_2 < .51, 0, 1),
            1.01 * ifelse(x_2 < .49, 0, 1)
)

In this example y is a true tree on (x_1, x_2)

                  root
              /          \
        x_1 < .5         x_1 >= .5
        /      \         /        \
x_2 < .51  x_2 >= .51  x_2 < .49   x_2 >= .49

but constructed in a way that

  • The effect of true first split on the response is imperceptible when compared to the second split.
  • The difference in second true second splits based on the first split is very small.

Here's a visualization of the effect

color = ifelse(x_1 < .5, "red", "blue")
plot(x_2, y, col = color)

enter image description here

This should trick a greedy algotithm into choosing x_2 < .5 as its first split

D <- data.frame(x_1, x_2, y)
rpart(y ~ x_1 + x_2, data = D)

which results in

1) root 100 25.124860 0.5012  
  2) x_2< 0.4971595 50  0.000000 0.0000 *
  3) x_2>=0.4971595 50  0.004712 1.0024 *

which is pretty much what I predicted would happen (maybe even a bit worse, as the tree does not even find any dependence on x_1).

To avoid this, you would have to use a non-greedy algorithm that searches the entirety of the collection of possible trees and finds the truly optimal one. The only way I can think of to do this is an exhaustive search. This would be feasible for small depth trees, but the computational time blows up exponentially, so it will quickly become impossible for deep trees.

Note: I suspect that fitting a tree exactly is an NP problem, but I'm still searching for a reference.

Matthew Drury
  • 35,629
  • Very good explaination of why rpart fail to find the optimal tree, could you elaborate more on how to construct a "non-greedy algorithm that searches the entirety of the collection of possible trees and finds the truly optimal one"? – Metariat Jul 04 '16 at 20:28
-1

I think you may try the Bayesian Additive Regression Tree method and there is a bartMachine R package you may try.

tairen
  • 101