I'm work on a small data set with a many features where most of them are just garbage. The goal is to have a good classification accuracy on this binary classification task.
So, I made up a small example code illustrating the problem. The code simply creates a binary data set with a lot of random features and one useful feature for the class label 1. Then I perform a simple model selection via on a linear SVM. The problem is that the classification accuracy is very poor or even random. (I also tried with a StratifiedKFold with equal results)
So, why the SVM struggles so much in finding the good pattern? This could also be a problem of reduced number of samples but I can't increase the dataset.
P.S. I would like to solve the problem (if a solution exist) without feature selection
import numpy as np
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit
# dataset
N = 10
X1 = np.random.rand(N,500)
X2 = np.random.rand(N,500)
X2[:,100] = 1
data = np.concatenate((X1,X2), axis=0)
labels = np.concatenate((np.zeros(N),np.ones(N)))
# shuffle and normalization
data, labels = shuffle(data, labels)
scaler = preprocessing.StandardScaler().fit(data)
data_n = scaler.transform(data)
# CV
sss = StratifiedShuffleSplit(labels, n_iter=100, test_size=0.4, random_state=0)
clf = GridSearchCV(SVC(kernel='linear'), {'C': np.logspace(-4,2,100)}, cv=sss)
clf.fit(data_n, labels)
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
Thanks.