Question about the Scikit-learn "SVM-Anova: SVM with univariate feature selection" example

Question

Can anyone explain to me why in the Scikit-learn "SVM-Anova: SVM with univariate feature selection" example

http://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html

when we use all features (100 percentile), the model has much better accuracy (90%) than when we use say the top 20 percentile of features (20% accuracy).

I'm confused because I think

1) Only the first 64 features are predictive

2) This is a cross-validation measure already

Does it mean the ANOVA feature selection has selected many random features with no relation to the dependent variable?

score 3 · Answer 1 · answered Jan 06 '17 at 22:37

In general, adding more and more features (even if its not necessarily useful for making predictions) would improve the accuracy of the trained model. When the number of features would be equal to or greater than the number of samples used for training, then (not surprising) one would observe a high accuracy close to 100% on the training set (even under cross-validation). This is simply because of over-fitting.

This can be illustrated using a simple example:

If you have two data-points, you can easily fit a line passing through both of them. In this example, the features would be the slope and intercept of line. [no. of data-points = no. of features = 2]

On the other, hand if there were three or more non-colinear points, you would have to trend a least-squares fit. As a result, the linear model would have less than 100% accuracy.

Coming back to the scikit-learn example in your query:

The digits dataset has 200 samples with 64 features. Additionally 200 randomly generated features are introduced so that the dataset is in the curse of dimension settings (i.e. no. of features > no. of samples). So one would naturally observe higher accuracy for 100 percentile case. Remember, this higher accuracy is at the cost of generalization.

If you wanted to figure out what fraction of total 264 features is meaningful for prediction, you would have to keep aside a fraction of dataset purely for testing (even under cross-validation mode).

score 1 · Answer 2 · answered Nov 28 '17 at 06:45

the problem is that at the top of the examples it says:

This example shows how to perform univariate feature selection before running a SVC (support vector classifier) to improve the classification scores.

but the example doesn't show that - it shows that if you take 100% of features you get the highest score. Which is confusing, and contradicts the "Curse of Dimensionality". I would have assumed, like the OP, that extraneous, noisy variables would lead to over-fitting which would in turn be caught in the cross-validation and give a lower score. But it didn't.

I don't find San's answer convincing because it seems oblivious of "the curse". More features alone should not give better answers if proper cross-validation is happening.

This seems to be a comment rather than an answer – Juho Kokkala Nov 28 '17 at 07:58 — Juho Kokkala, Nov 28 '17 at 07:58
and it is, but I didn't have enough points to comment... – JMann Nov 28 '17 at 22:10 — JMann, Nov 28 '17 at 22:10

Question about the Scikit-learn "SVM-Anova: SVM with univariate feature selection" example

2 Answers2