7

How to determine feature importance while using xgboost (XGBclassifier or XGBregressor) in pipeline?

AttributeError: 'Pipeline' object has no attribute 'get_fscore'

The answer provided here is similar but I couldn't get the idea.

ebrahimi
  • 1,307
  • 7
  • 20
  • 40

2 Answers2

6

As I found, there are two ways to determine feature importance: First:

print(grid_search.best_estimator_.named_steps["clf"].feature_importances_)

result:

[ 0.14582562  0.08367272  0.06409663  0.07631433  0.08705109  0.03827286
  0.0592836   0.05025916  0.07076083  0.0699278   0.04993521  0.07756387
  0.05095335  0.07608293]

Second:

print(grid_search.best_estimator_.named_steps["clf"].booster().get_fscore())

result:

{'f2': 1385, 'f11': 1676, 'f12': 1101, 'f6': 1281, 'f9': 1511, 'f7': 1086, 'f5': 827, 'f0': 3151, 'f10': 1079, 'f1': 1808, 'f3': 1649, 'f13': 1644, 'f8': 1529, 'f4': 1881}

Third:

print(grid_search.best_estimator_.named_steps["clf"].get_booster().get_fscore())
ebrahimi
  • 1,307
  • 7
  • 20
  • 40
  • the problem I found is that the resultant feature names were not present in the training data itself. I had obviously renamed the f2 to the X2. Did you face such an issue ? – fixxxer Jun 24 '17 at 08:10
  • @fixxxer First, I am so sorry for being late. Then, in the above example, my features don't have names so the result is reported by f1, f2, and etc. However, If I use some specific names there is no problem. For example, I do this: ln=X.shape. names = ["x%s" % i for i in range(1,ln[1]+1)]. print(sorted(zip(map(lambda x: round(x, 4), grid_search.best_estimator_.named_steps["clf"].feature_importances_),names), reverse=True)). – ebrahimi Jun 28 '17 at 07:55
  • Thank @ebrahmi. It's an issue with the Sklearn wrapper that the feature names are not present through an API. So, what worked for me is that I had to do this to get the exact column names - dict(zip(t.feature_names, train.columns)) – fixxxer Jun 28 '17 at 12:00
  • @fixxxer Could you please let me know how to print sorted features in conjunction with their names. Next, select the five features with the most feature importance. For example, f1:20, f5:17, f7:14, f10:13,f6:10, f3:8, ...

    and then, f1, f5, f7, f10, f6

    – ebrahimi Dec 14 '17 at 18:41
2

Getting a reference to the xgboost object

You should first get the XGBClassifier or XGBRegressor element from the pipeline. You could do this either by getting the n-th element or by specifying the name.

clf = XGBClassifier()
pipe = Pipeline([('other', other_element), ('xgboost', clf)])

To get the XGBClassifier you could either:

  • use clf if you still have a reference to it
  • index the pipeline by name: pipe.named_steps['xgboost']
  • index the pipeline by location: pipe.steps[1]

Getting the importance

Secondly, it seems that importance is not implemented for the sklearn implementation of xgboost. See this github issue. A solution to add this to your XGBClassifier or XGBRegressor is also offered over their. It boils down to adding the methods to the class yourself.

Pieter
  • 961
  • 6
  • 19
  • Thanks. Does 'feature_importances_' do the same as 'get_fscore'? Here (stackoverflow.com/questions/38212649/…) 'feature_importances_' is used.@ Pieter – ebrahimi Jan 01 '17 at 14:05
  • If you look at the code you see that it uses get_fscore() internally – Pieter Jan 01 '17 at 15:23