1

I was expecting the scores_ provided by SelectKBest() to be the result of the score_func (e.g. the value of the F statistics itself when score_func = f_classif, or the chi2 statistics when score_func = chi2).

But it is apparently not the case. I get

selector = SelectKBest(f_classif, k=2)
selector.fit(x_train, y_train)

pd.DataFrame({'variable': x_train.columns,
              'score_f_anova': selector.scores_})
_variable_  |_score_f_anova_
variable_1  |43.525376
variable_2  |28.432673

which is different from the F statistics I get from:

from scipy.stats import f_oneway
print(f_oneway(x_train['variable_1'], y_train)[0])
print(f_oneway(x_train['variable_2'], y_train)[0])
141.423192895752
90.45067843845831

Therefore:

  • How is scores_ computed?
  • Which of scores_ and pvalues_ should I use to rank the features by importance?

2 Answers2

2

f_classif and f_oneway produce the same results but differ in implementation and use.

First, recall that 1-way ANOVA tests the null hypothesis that samples in two or more classes have the same population mean. In your case, I suppose y_train is an array-like categorical variable containing some classes, and x_train is an array-like or dataframe containing features. Then the right way to pass sample measurements for each class into f_oneway is

f_oneway(x_train[y_train == 'class1'], x_train[y_train == 'class2'], # ...,
         x_train[y_train == 'classk'])

or more conveniently

f_oneway(*[x_train[y_train == c] for c in np.unique(y_train)])

f_classif and other scoring functions from scikit-learn require both features and labels:

f_classif(x_train, y_train)

Example:

>>> import numpy as np
>>> from scipy.stats import f_oneway
>>> from sklearn.datasets import make_classification
>>> from sklearn.feature_selection import f_classif
>>> n_samples, n_features, n_classes = 1000, 4, 3
>>> X, y = make_classification(n_samples=n_samples, n_features=n_features,
...     n_classes=n_classes, n_informative=3, n_redundant=1,
...     scale=0.2, random_state=0)
>>> f_tuple = f_oneway(*[X[y == k] for k in range(n_classes)])
>>> f_stat, f_prob = f_classif(X, y)
>>> np.allclose(f_tuple.statistic, f_stat), np.allclose(f_tuple.pvalue, f_prob)
(True, True)
>>> f_stat.shape[0] == n_features
True

Finally, SelectKBest produces exactly the same results as the score_func function provided and stores them in the scores_ attribute. So, your selector's scores are correct and will be equal to an f_oneway's output if you pass the arguments properly.


If you rank features manually, it is up to you whether to rely on scores or p-values. But If you apply scikit-learn's feature selection techniques, it depends on the implementation. SelectKBest and SelectPercentile rank by scores, while SelectFpr, SelectFwe, or SelectFdr by p-values. If p-values are supported by a scoring function, then you can use the latter three models with this function and some p-value (alpha, the highest p-value for features to keep).

Example:

>>> from sklearn.feature_selection import SelectFpr
>>> s = SelectFpr(score_func=f_classif, alpha=0.01)
>>> X_t = s.fit_transform(X, y)
>>> X_t.shape
(1000, 3) # Recall that one feature was uninformative (redundant)
0

I would like to add that you can use the following function to use a scipy stats function into GenericUnivariateSelect.

def sk_func_from_scipy(scipy_func):
    def sk_func(X,y):
        sc_vect = []
        pval_vect = []
        for i in range(X.shape[1]):
            result = scipy_func(*[X[y == c][:,i] for c in np.unique(y)])
            sc_vect.append(result[0])
            pval_vect.append(result[1])
        return (np.array(sc_vect), np.array(pval_vect))
print(scipy_func)
return sk_func 

Then you can use scipy functions like kruskal or mannwithneyu:

import scipy.stats
fs = GenericUnivariateSelect(score_func=sk_func_from_scipy(stats.kruskal), mode='k_best', param=5)
X_selected = fs.fit_transform(X, y)

and

 fs = GenericUnivariateSelect(score_func=sk_func_from_scipy(stats.mannwhitneyu), mode='fpr', param=1e-5)
X_selected = fs.fit_transform(X, y)

Note that

GenericUnivariateSelect(score_func=sk_func_from_scipy(f_oneway), mode='k_best', param=5)

should give you the same results as

fs = SelectKBest(score_func=f_classif, k=5)
Gabriel
  • 101