How does SelectKBest work?

Question

I am looking at this tutorial: https://www.dataquest.io/mission/75/improving-your-submission

At section 8, finding the best features, it shows the following code.

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

What is k=5 doing, since it is never used (the graph still lists all of the features, whether I use k=1 or k="all")? How does it determine the best features, are they independent of the method one wants to use (whether logistic regression, random forests, or whatever)?

Select features according to the k highest scores. – Srini Mar 12 '19 at 11:45 — Srini, Mar 12 '19 at 11:45

score 14 · Accepted Answer · answered Mar 18 '16 at 12:42

The SelectKBest class just scores the features using a function (in this case f_classif but could be others) and then "removes all but the k highest scoring features". http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

So its kind of a wrapper, the important thing here is the function you use to score the features.

For other feature selection techniques in sklearn read: http://scikit-learn.org/stable/modules/feature_selection.html

And yes, f_classif and chi2 are independent of the predictive method you use.

score 4 · Answer 2 · edited Mar 11 '21 at 20:04

4

The k parameter is important if you use selector.fit_transform(), which will return a new array where the feature set has been reduced to the best 'k'.

edited Mar 11 '21 at 20:04

Ethan

1,633
9
24
39

answered May 31 '16 at 15:16

Chris Thompson

41
2

How does SelectKBest work?

2 Answers2

Linked