0

I'm trying to run a quick univariate filtering on some data, using a t-test of independence, since my target is binary. However, when I run the filter using sklearn's SelectKBest, I get the same top features returned doing a manual filter, but in different order. The only information about SelectKBest I could find is here and the documentation, but both seem like they should work like my manual method. My code is

import numpy as np
from sklearn.feature_selection import SelectKBest
from scipy.stats import ttest_ind

np.random.seed(0)

data = np.random.random((100,50))
target = np.random.randint(2, size = 100).reshape((100,1))

X = data
y = target.ravel()

k = 10
p_values = []
for i in range(data.shape[1]):

    t, p = ttest_ind(data[:,i], target)
    p_values.append([i,p[0],t[0]])

p_values = sorted(p_values, key = lambda x: x[1])
p_values = p_values[:k]

# Indices of the ranked p-values
idx = [i[0] for i in p_values]

# SelectKBest features
mdl = SelectKBest(ttest_ind, k = k)

X_new = mdl.fit_transform(X, y)
# Manually selected k best features
X_new2=X[:,idx]


# Print first row of sklearn features
print(X_new[0])
array([0.4236548 , 0.96366276, 0.38344152, 0.87001215, 0.63992102,
   0.52184832, 0.41466194, 0.06022547, 0.67063787, 0.31542835])

# Print first row of manually selected features
print(X_new2[0])
array([0.67063787, 0.4236548 , 0.31542835, 0.87001215, 0.38344152,
   0.63992102, 0.06022547, 0.52184832, 0.41466194, 0.96366276])

It seems SelectKBest is not ordering the features based solely on their p-values or their t-values. How does SelectKBest order the features then?

m13op22
  • 382
  • 1
  • 4
  • 18

1 Answers1

1

No, SelectKBest and other *Select* transformers from sklearn.feature_selection do not change order of features, only drop not selected ones. Anyway, generally, machine learning models do not utilize relative order of a feature.

If you need to check and reorder features, you can use scores_ and/or pvalues_ attributes of a fitted transformer (e.g. SelectKBest) object.

Under the hood those classes use a boolean mask to identify which features to keep. The boolean mask is composed using the first (or the only) set of scores. Those would be t-values in your case. p-values are not used by SelectKBest to select features for now, according to the source code.

ebrahimi
  • 1,307
  • 7
  • 20
  • 40