19

I am looking at this tutorial: https://www.dataquest.io/mission/75/improving-your-submission

At section 8, finding the best features, it shows the following code.

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

What is k=5 doing, since it is never used (the graph still lists all of the features, whether I use k=1 or k="all")? How does it determine the best features, are they independent of the method one wants to use (whether logistic regression, random forests, or whatever)?

user
  • 1,993
  • 6
  • 21
  • 38

2 Answers2

14

The SelectKBest class just scores the features using a function (in this case f_classif but could be others) and then "removes all but the k highest scoring features". http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

So its kind of a wrapper, the important thing here is the function you use to score the features.

For other feature selection techniques in sklearn read: http://scikit-learn.org/stable/modules/feature_selection.html

And yes, f_classif and chi2 are independent of the predictive method you use.

pgalilea
  • 534
  • 3
  • 7
4

The k parameter is important if you use selector.fit_transform(), which will return a new array where the feature set has been reduced to the best 'k'.

Ethan
  • 1,633
  • 9
  • 24
  • 39