The questions: Why is Random Forest not better at finding an interaction of a simple indicator $\times$ continuous variable? What kind of machine learning model would be better at finding this interaction?
First, please see this background question about the interaction between multiples of two continuous variables:
Including Interaction Terms in Random Forest
I reproduce the answer in Python as
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
def abline(intercept, slope):
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = intercept + slope * x_vals
plt.plot(x_vals, y_vals, '--', c='orange')
X = np.random.normal(scale=3, size=(1000, 2))
random_noise = np.random.normal(scale=5, size=(1000,))
y = X[:,0] + X[:,1] + 5*X[:,0]*X[:,1] + random_noise
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
mod_reg = LinearRegression()
preds_reg = mod_reg.fit(X_train, y_train).predict(X_test)
mod_tree = RandomForestRegressor(n_estimators=100)
preds_tree = mod_tree.fit(X_train, y_train).predict(X_test)
plt.scatter(y_test, preds_reg, c='k')
plt.scatter(y_test, preds_tree)
plt.legend(labels=['linear', 'tree'])
abline(0, 1)
As we can see, the Random Forest is very successful in finding an interaction term of the form x_1*x_2.
I am interested though in finding interactions based on categorical variables. Specifically, I am positing the following kind of interaction:
X = np.random.normal(scale=3, size=(1000, 2))
random_noise = np.random.normal(scale=5, size=(1000,))
y = X[:,0] + X[:,1] + random_noise
elements = [0, 1]
probabilities = [0.20, 0.80]
flag = np.random.choice(elements, 1000, p=probabilities)
leaky = np.where(flag==0, y, np.random.normal(scale=5, size=(1000,)))
X = np.column_stack([X, flag, leaky])
So I have two continuous variables, $x_1$ and $x_2$; then I have a dummy variable which can take $\{1, 0\}$ and a fourth continuous variable, $x_3$, which when $I=0$ knows ground truth of y (i.e., it is very informative), but when $I=1$, it has no information.
On the face of it, this seems like the perfect situation for a Random Forest because if it splits on $I$, then it gets into a branch where it has perfect knowledge of the target.
However, results are very disappointing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
mod_reg = LinearRegression().fit(X_train, y_train)
preds_reg = mod_reg.predict(X_test)
mod_tree = RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
preds_tree = mod_tree.predict(X_test)
print(mod_reg.score(X_test, y_test))
print(mod_reg.score(X_test[X_test[:, 2]==0], y_test[X_test[:, 2]==0]))
0.471554024327
0.583852124819
print(mod_tree.score(X_test, y_test))
print(mod_tree.score(X_test[X_test[:, 2]==0], y_test[X_test[:, 2]==0]))
0.35785497514
0.633730050982
The RMSE of the tree, when predicting $y$'s where $I=0$ is only 63.37% and only marginally better than the linear model. I would have expected >> 90%.
So, the question above: why isn't it better at finding this and what would be better?
