Which algorithm can learn exactly a tree structure without noise?

Question

Let's say I have a response variable $Y$ satisfying a complicated tree-structure:

$$Y=f(X_1,X_2,X_3,\ldots,X_p) + \varepsilon$$

where:

$\varepsilon = 0$
$f$ is a known deep tree-structure function (not necessarily binary). That is to say we can have as many observations as we want.

My question is: Which highly interpretable algorithm can learn perfectly $f$? I first thought of CART, but my expriment on R via rpart doesn't give the desirable result.

I tried two parameters in rpart: $cp = -1$ or $0$ -> every split decreasing the overall lack of fit is attempted.

What parameters did you use on rpart? What are the characteristics of the tree that generates your data? — Firebug, Jul 04 '16 at 14:43
I updated the information in the question. As the characteristics of the tree, I asked about the tree structure in general, in no particular form. But does it change something? — Metariat, Jul 04 '16 at 14:49
I'm not sure, but perhaps it's a tree so deep/complex you have to take some extreme parameters to recreate it. — Firebug, Jul 04 '16 at 15:06
@Firebug What do you expect by "extreme parameters"? I had the same thought, that's why I tried cp = 0 at first then changed to -1, the result got better but still I didn't get $f$. Other parameters doesn't make sense to me in this context. — Metariat, Jul 04 '16 at 15:10
minsplit/minbucket comes to mind, and also is depth beyond 30? — Firebug, Jul 04 '16 at 15:20
@Firebug all these parameters will only limit the complexity of the tree if I understood correctly. — Metariat, Jul 04 '16 at 15:34
Why add $\varepsilon$ if it's equal to $0$. Might you have meant $\operatorname{E}(\varepsilon) = 0$ rather then $\varepsilon=0$? $\qquad$ — Michael Hardy, Jul 04 '16 at 17:14

Matthew Drury · Accepted Answer · 2016-07-04T19:52:12.950

General purpose decision tree procedures (rpart included) do not attempt to find the tree that best fits the data. Instead they simply attempt to find a "good" tree by searching greedily.

Greedy means that, any time a split is under consideration, the best split is returned without considering any information from potential future splits. It is rather easy to concieve of a situation where this will fail to find the optimal tree

x_1 <- runif(100)
x_2 <- runif(100)

y <- ifelse(x_1 < .5,
            0.99 * ifelse(x_2 < .51, 0, 1),
            1.01 * ifelse(x_2 < .49, 0, 1)
)

In this example y is a true tree on (x_1, x_2)

                  root
              /          \
        x_1 < .5         x_1 >= .5
        /      \         /        \
x_2 < .51  x_2 >= .51  x_2 < .49   x_2 >= .49

but constructed in a way that

The effect of true first split on the response is imperceptible when compared to the second split.
The difference in second true second splits based on the first split is very small.

Here's a visualization of the effect

color = ifelse(x_1 < .5, "red", "blue")
plot(x_2, y, col = color)

This should trick a greedy algotithm into choosing x_2 < .5 as its first split

D <- data.frame(x_1, x_2, y)
rpart(y ~ x_1 + x_2, data = D)

which results in

1) root 100 25.124860 0.5012  
  2) x_2< 0.4971595 50  0.000000 0.0000 *
  3) x_2>=0.4971595 50  0.004712 1.0024 *

which is pretty much what I predicted would happen (maybe even a bit worse, as the tree does not even find any dependence on x_1).

To avoid this, you would have to use a non-greedy algorithm that searches the entirety of the collection of possible trees and finds the truly optimal one. The only way I can think of to do this is an exhaustive search. This would be feasible for small depth trees, but the computational time blows up exponentially, so it will quickly become impossible for deep trees.

Note: I suspect that fitting a tree exactly is an NP problem, but I'm still searching for a reference.

Very good explaination of why rpart fail to find the optimal tree, could you elaborate more on how to construct a "non-greedy algorithm that searches the entirety of the collection of possible trees and finds the truly optimal one"? — Metariat, Jul 04 '16 at 20:28

score -1 · Answer 2 · answered Jul 04 '16 at 16:18

-1

I think you may try the Bayesian Additive Regression Tree method and there is a bartMachine R package you may try.

answered Jul 04 '16 at 16:18

tairen

101

1

How does this address the posters question? – Matthew Drury Jul 04 '16 at 19:43

Which algorithm can learn exactly a tree structure without noise?

2 Answers2

Linked