6

I'm work on a small data set with a many features where most of them are just garbage. The goal is to have a good classification accuracy on this binary classification task.

So, I made up a small example code illustrating the problem. The code simply creates a binary data set with a lot of random features and one useful feature for the class label 1. Then I perform a simple model selection via on a linear SVM. The problem is that the classification accuracy is very poor or even random. (I also tried with a StratifiedKFold with equal results)

So, why the SVM struggles so much in finding the good pattern? This could also be a problem of reduced number of samples but I can't increase the dataset.

P.S. I would like to solve the problem (if a solution exist) without feature selection

import numpy as np
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit

# dataset
N = 10
X1 = np.random.rand(N,500)
X2 = np.random.rand(N,500)
X2[:,100] = 1
data = np.concatenate((X1,X2), axis=0)
labels = np.concatenate((np.zeros(N),np.ones(N)))

# shuffle and normalization
data, labels = shuffle(data, labels)
scaler = preprocessing.StandardScaler().fit(data)                                      
data_n = scaler.transform(data)

# CV
sss = StratifiedShuffleSplit(labels, n_iter=100, test_size=0.4, random_state=0)
clf = GridSearchCV(SVC(kernel='linear'), {'C': np.logspace(-4,2,100)}, cv=sss)
clf.fit(data_n, labels)
for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))

Thanks.

Sycorax
  • 90,934
lcit
  • 206

1 Answers1

7

TL;DR: Garbage in, garbage out. Selecting better features will promote a better model. (Sometimes the answer really is that simple!) What follows is a description of one path forward to selecting higher-quality features in the context of fitting an SVM.

SVM performance can suffer when presented with many garbage features because the model only works with the data through the kernel function, rather than working with the features directly as in a more traditional regression analysis. I'll illustrate by comparison to the standard linear kernel and the so-called "automatic relevance determination" method.

The standard linear kernel function is $K_1(x,x^\prime)=x^Tx^\prime.$ All features contribute to the output of $K_1$: first we compute the element-wise product, then we sum the products. There's no step that assesses which components of $x$ are more useful than others.

We could, if we were so inclined, include a scalar factor $\gamma$ to yield $K_2(x,x^\prime)=x^T\gamma x^\prime,$ but a scalar $\gamma$ simply has the effect of re-scaling $C$, so there are contours of equal performance quality in $(\gamma,C)$ space.

But if we replace $\gamma$ with diagonal, symmetric positive semi-definite (SPSD) $\Gamma$, we have $K_3(x,x^\prime)=x^T\Gamma x.$ We can think of this as estimating a coefficient for each entry in $x$, i.e. each feature. We may interpret diagonal elements of $\Gamma$ nearer to zero as contributing relatively little to the classification output, while diagonal elements larger in absolute value contribute more to the output. On the one hand, for $d$ features, you now have $d+1$ tuning parameters (each element of $\Gamma$ and $C$), but on the other, you may submit all features directly to the SVM.

This procedure can be further generalized to non-diagonal, but still SPSD, $\Gamma$ to admit nonzero correlation among features. This will yield $\frac{d(d+1)}{2}+1$ tuning parameters, which rapidly becomes unattractive as $d$ grows.

Finally, this ARD approach may be extended to other kernels. The RBF kernel varies through the squared Euclidean distance, so we may write $K_4=\exp\left(\frac{(x-x^\prime)^T\Gamma(x-x^\prime)}{\sigma}\right),$ and in general replace any squared Euclidean distance with $(x-x^\prime)^T\Gamma(x-x^\prime).$

P.S. I would like to solve the problem (if a solution exist) without feature selection

So... you want to find a subset of features which have high predictive value, but you don't want to do "feature selection"? Perhaps I'm being dense, but it sounds like what you're after is a contradiction in terms. I suppose what an acceptable solution would look like to you depends on how expansive your definition of "feature selection" is.

Sycorax
  • 90,934
  • Very good explanation thanks. So, by adding a symmetric positive semi-definite matrix to a linear kernel we should be able to shrink garbage features but how do you calculate the diagonal coefficients of the matrix? Concerning the P.S: Firstly, I would like to have good classification accuracy. Afterwards, if the score is good I will be able to extract the feature from the garbage for instance by looking at the weights of the hyperplane if this make sense. – lcit Jul 30 '15 at 17:57
  • You tune $\Gamma$ at the same time that you tune $C$. If $d>2$, I'd recommend a more sophisticated hyperparameter estimation algorithm than grid search. – Sycorax Jul 30 '15 at 17:59
  • Re: feature selection: If you only take features with large $\Gamma_{ii}$ for diagonal $\Gamma$, you'll need to re-fit the model with just that subset of features to get accurate performance estimates. Be sure that you perform CV correctly to avoid information leakage across steps. – Sycorax Jul 30 '15 at 18:06
  • I will give it a try but seems to expensive. Perhaps an initialisation using for instance the Pearson correlation or the mutual information will helps. – lcit Jul 30 '15 at 18:19
  • It is expensive. Packages like Optunity and DiceOptim implement global optimization algorithms. I haven't used either. – Sycorax Jul 30 '15 at 18:24