2

I am using boruta_py, Python implementation of the Boruta algorithm, with a random forest estimator.

model = RandomForestClassifier(n_estimators=5000, n_jobs=-1)

boruta = BorutaPy(model, max_iter=10, verbose=2, random_state=1)

And I fit boruta like so:

boruta.fit(np.array(X_train), np.array(Y_train)) # X_train is a DataFrame

transform input

X_train_br = boruta.transform(np.array(X_train)) X_test_br = boruta.transform(np.array(X_test))

then fit the RF estimator

model.fit(X_train_br, Y_train)

My input has 240 features, i.e.:

>>> X_train.shape
(67092, 240)

Fitting boruta for max_iter = 10, but the strange thing is no single feature is classified Confirmed or Rejected:

...
building tree 1000 of 1000
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.2min finished
Iteration:  5 / 10
Confirmed:  0
Tentative:  240
Rejected:   0
....

What am I doing incorrect here?

arilwan
  • 273

2 Answers2

0

Why would you expect anything to be wrong? You are using default settings and it is a valid result. You could try using more iterations to see if it leads to more useful results. It's better if the algorithm can say “I don't know” rather than forcing an arbitrary choice.

But why are you using it in the first place? Random Forest will pick the features it needs on its own, even colinear features are not a problem.

Tim
  • 138,066
  • 2
    While correlated or even duplicated features are not a problem for estimating an RF model, they can be a problem for estimating feature importance metrics because 2+ features are “competing” for the same importance value. This is one of the quirks of using Boruta in the presence of highly correlated/duplicated features: informative features can have their importance metrics "deflated." If such features are present in OP's case, I wonder if omitting the redundant features would give more decisive results. – Sycorax May 15 '23 at 17:00
0

You can use more:

a) Recommended arguments

b) Non-default arguments for more strict/robust estimation

The variant of implementation:

a1) "We highly recommend using pruned trees with a depth between 3-7":

model = RandomForestClassifier(class_weight='balanced', 
                            max_depth=3, n_jobs=-1)

b1) Lower alpha to reject more

b2) n_estimators to 'auto': the more you overfit the fewer features are rejected

b3) Increase n_iter, if its too low most of features will stay tentative as on iter .

b4) Try different random_state because the results varies

feat_selector = BorutaPy(rf, 
    n_estimators='auto', 
    n_iter=300,
    verbose=2, 
    random_state=42,
    alpha=0.01)

If it still scores all features as tentative you can use ranking in SHAP method and set threshold there so the features to exclude will definitely appear