Feature importance with scikit-learn Random Forest shows very high Standard Deviation

Question

I am using scikit-learn Random Forest Classifier and I want to plot the feature importance such as in this example.

However my result is completely different, in the sense that feature importance standard deviation is almost always bigger than feature importance itself (see attached image).

Is it possible to have such kind of behaviour, or am I doing some mistakes when plotting it?

My code is the following:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(predictors.values, outcome.values.ravel())

importance = clf.feature_importances_
importance = pd.DataFrame(importance, index=predictors.columns, 
                          columns=["Importance"])

importance["Std"] = np.std([tree.feature_importances_
                            for tree in clf.estimators_], axis=0)

x = range(importance.shape[0])
y = importance.ix[:, 0]
yerr = importance.ix[:, 1]

plt.bar(x, y, yerr=yerr, align="center")

plt.show()

IIUC, predictors returns a numpy array which you are referencing to a pandas Dataframe object by it's columns which is incorrect as numpy arrays do not have the attribute columns. — Nickil Maveli, Aug 05 '16 at 10:31
Sorry, it was a typo on the code. predictors and outcome are two pandas DataFrame with shape m x n and m x 1. It should be clear now. — gc5, Aug 05 '16 at 10:34
I have come across the same findings some while ago. Could be that this is due to the fact that a number of features are important, but as features can be high or low in the decision tree (as only a random subset are offered when making a split), their importance varies highly from tree to tree, which results in a high standard deviation. — Archie, Aug 30 '16 at 12:22
Great post, I've run into an identical issue as you can see in the picture. There is a package, tsfresh that helped me identify relevant features and cut my features from 600+ to around 400. Even with this the algorithm is performing well for me. I have a binary classification, success/failure. I get virtually no false successes but I do miss a sizable percent of success. All the guesses above seem reasonable. It could be the case there needs to be a larger training and testing set. I have fewer — superhero, Dec 06 '16 at 22:48

score 3 · Answer 1 · answered Aug 05 '16 at 10:54

3

You are using RandomForest with the default number of trees, which is 10. For around 30 features this is too few. Therefore standard deviation is large. Try at least 100 or even 1000 trees, like

clf = RandomForestClassifier(n_estimators=1000)

For a more refined analysis you can also check how large the correlation between your features is.

answered Aug 05 '16 at 10:54

lanenok

1,516
9
9

Sorry lanenok, the number of trees is not the default one. I put an example code (and this is true for all the parameters, e.g. min_samples_split) because I cannot disclose the data I am working on. However, is it due to the number of trees, plus other parameters, or am I doing some mistakes here? – gc5 Aug 05 '16 at 11:16

score 2 · Answer 2 · edited Apr 13 '17 at 12:50

2

Your result is not that weird. As lanenok states, you should in a first step increase the number of trees in order to make sure that you get a 'statistical' result concerning the feature importances.

However, as this paper by Genuer et al. (2010) shows, you can actually use the standard deviations in order to eliminate features. To quote: "We can see that true variables standard deviation is large compared to the noisy variables one, which is close to zero. "

edited Apr 13 '17 at 12:50

Community

1

answered Feb 07 '17 at 09:52

Archie

863
8
20

Using standard deviation in this example to eliminate features would eliminate all features. xD – Jorge Leitao Feb 07 '17 at 10:18
Haha, I'm not entirely sure, I think you could safely discard the features on the far right? Anyhow, main point I am trying to make is that high standard deviations are not that weird, and that you can actually use them in your strategy to eliminate features. – Archie Feb 07 '17 at 10:24

score 1 · Answer 3 · answered Oct 11 '16 at 20:26

1

Try clf = RandomForestClassifier(max_features=None). The max_features param defaults to 'auto' which is equivalent to sqrt(n_features). max_features is described as "The number of features to consider when looking for the best split." Only looking at a small number of features at any point in the decision tree means the importance of a single feature may vary widely across many tree. So, don't look at a random subset, just look at all features at every level of the tree.

answered Oct 11 '16 at 20:26

jamis

169
1
1
3

1

Note this is the equivalent to plain bagged trees. The "random" in random forests means to consider a random subset of features at each split, usually sqrt(n_features) or log2(n_features). max_features=None no longer considers a random subset of features. I am not sure if this effects the solution proposed above. One possibility is many features simply have a large amount of importance and therefore vary widely across the ensemble of trees. Or maybe there isn't enough samples and so not every feature is considered by the time you hit a leaf. – jamis Oct 13 '16 at 14:41

score 1 · Answer 4 · answered Feb 07 '17 at 09:26

A common reason for this is that the parameters you supplied (or defaulted) to RandomForestClassifier are not suited for your dataset.

A common way to address this problem is to search the hyperparameter space using e.g. GridSearchCV:

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer

param_grid = {'n_estimators': [10, 100, 1000], 'max_features': [5, 10, 20, 30]}
clf = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring=make_scorer(accuracy_score))

param_grid here is the permutations of parameters that you want to search in, and the make_scorer(accuracy_score) is the measure you want to optimize.

Note that accuracy_score is suitable for balanced sets, but not for unbalanced sets. Choose a suitable metric to on your particular objective.

score 0 · Answer 5 · answered Sep 30 '16 at 18:56

There could be multiple reasons. The number of trees and the depth can change your results. If your model doesn't perform well after selecting the parameters (cross-validation etc.), it's probably because your features are not very predictive, so they get picked almost "randomly" which leads to high standard deviations from tree to tree. But there are other possibilities, e.g. it could also be that your features are highly correlated. A little more information would be helpful.

Feature importance with scikit-learn Random Forest shows very high Standard Deviation

5 Answers5