2

The questions: Why is Random Forest not better at finding an interaction of a simple indicator $\times$ continuous variable? What kind of machine learning model would be better at finding this interaction?

First, please see this background question about the interaction between multiples of two continuous variables:

Including Interaction Terms in Random Forest

I reproduce the answer in Python as

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

def abline(intercept, slope):
    axes = plt.gca()
    x_vals = np.array(axes.get_xlim())
    y_vals = intercept + slope * x_vals
    plt.plot(x_vals, y_vals, '--', c='orange')

X = np.random.normal(scale=3, size=(1000, 2))
random_noise = np.random.normal(scale=5, size=(1000,))
y = X[:,0] + X[:,1] + 5*X[:,0]*X[:,1] + random_noise

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

mod_reg = LinearRegression()
preds_reg = mod_reg.fit(X_train, y_train).predict(X_test)

mod_tree = RandomForestRegressor(n_estimators=100)
preds_tree = mod_tree.fit(X_train, y_train).predict(X_test)

plt.scatter(y_test, preds_reg, c='k')
plt.scatter(y_test, preds_tree)
plt.legend(labels=['linear', 'tree'])
abline(0, 1)

As we can see, the Random Forest is very successful in finding an interaction term of the form x_1*x_2.

enter image description here

I am interested though in finding interactions based on categorical variables. Specifically, I am positing the following kind of interaction:

X = np.random.normal(scale=3, size=(1000, 2))
random_noise = np.random.normal(scale=5, size=(1000,))
y = X[:,0] + X[:,1] + random_noise

elements = [0, 1]
probabilities = [0.20, 0.80]
flag = np.random.choice(elements, 1000, p=probabilities)
leaky = np.where(flag==0, y, np.random.normal(scale=5, size=(1000,)))

X = np.column_stack([X, flag, leaky])

So I have two continuous variables, $x_1$ and $x_2$; then I have a dummy variable which can take $\{1, 0\}$ and a fourth continuous variable, $x_3$, which when $I=0$ knows ground truth of y (i.e., it is very informative), but when $I=1$, it has no information.

On the face of it, this seems like the perfect situation for a Random Forest because if it splits on $I$, then it gets into a branch where it has perfect knowledge of the target.

However, results are very disappointing:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

mod_reg = LinearRegression().fit(X_train, y_train)
preds_reg = mod_reg.predict(X_test)

mod_tree = RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
preds_tree = mod_tree.predict(X_test)


print(mod_reg.score(X_test, y_test))
print(mod_reg.score(X_test[X_test[:, 2]==0], y_test[X_test[:, 2]==0]))

0.471554024327
0.583852124819


print(mod_tree.score(X_test, y_test))
print(mod_tree.score(X_test[X_test[:, 2]==0], y_test[X_test[:, 2]==0]))

0.35785497514
0.633730050982

The RMSE of the tree, when predicting $y$'s where $I=0$ is only 63.37% and only marginally better than the linear model. I would have expected >> 90%.

So, the question above: why isn't it better at finding this and what would be better?

Jonathan
  • 441

1 Answers1

1

I realized that I made an error in generating the fake data here. The variable leaky is getting information about $x_1$ and $x_2$ which are mean 0 random variables. Leaking information about the realizations of pure noise in train is not going to help you predict pure noise in test. When I correct for this, the Random Forest does indeed find the interaction effect.

n_samples = 10000
x1 = np.random.normal(loc=0.5, scale=1, size=(n_samples,))
x2 = np.random.normal(loc=0.75, scale=1, size=(n_samples,))
random_noise = np.random.normal(scale=5, size=(n_samples,))
y = x1 + x2 + random_noise

elements = [0, 1]
probabilities = [0.20, 0.80]
flag = np.random.choice(elements, n_samples, p=probabilities)
leaky = np.where(flag==0, y, np.random.normal(scale=10, size=(n_samples,)))
X = np.column_stack([x1, x2, flag, leaky])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

mod_reg = LinearRegression().fit(X_train, y_train)
preds_reg = mod_reg.predict(X_test)

mod_tree = RandomForestRegressor(n_estimators=1000, min_samples_leaf=3).fit(X_train, y_train)
preds_tree = mod_tree.predict(X_test)

print(mod_reg.score(X_test, y_test))
print(mod_reg.score(X_test[X_test[:, 2]==0], y_test[X_test[:, 2]==0]))

0.0610796852093
0.140946827667

print(mod_tree.score(X_test, y_test))
print(mod_tree.score(X_test[X_test[:, 2]==0], y_test[X_test[:, 2]==0]))

0.175594757301
0.999790235233
Jonathan
  • 441