What are the limitations of Kernel methods and when to use kernel methods?

Question

Kernel methods are very effective in many supervised classification tasks. So what are the limitations of kernel methods and when to use kernel methods? Especially in the large scale data era, what are the advances of kernel methods? What is the difference between kernel methods and multiple instance learning? If the data is 500x10000, 500 is the count of samples, and 10000 is the dimension of each feature, then in this circumstance, can we use the kernel methods?

score 21 · Accepted Answer · edited Apr 13 '17 at 12:44

21

Kernel methods can be used for supervised and unsupervised problems. Well-known examples are the support vector machine and kernel spectral clustering, respectively.

Kernel methods provide a structured way to use a linear algorithm in a transformed feature space, for which the transformation is typically nonlinear (and to a higher dimensional space). The key advantage this so-called kernel trick brings is that nonlinear patterns can be found at a reasonable computational cost.

Note that I said the computational cost is reasonable, but not negligible. Kernel methods typically construct a kernel matrix $\mathbf{K} \in \mathbb{R}^{N\times N}$ with $N$ the number of training instances. The complexity of kernel methods is therefore a function of the number of training instances, rather than the number of input dimensions. Support vector machines, for example, have a training complexity between $O(N^2)$ and $O(N^3)$. For problems with very large $N$, this complexity is currently prohibitive.

This makes kernel methods very interesting from a computational perspective when the number of dimensions is large and the number of samples is relatively low (say, less than 1 million).

Related: Linear kernel and non-linear kernel for support vector machine?

SVM for Large Scale Problems

For very high dimensional problems, such as the 10000 dimensions you mention in the question, there is often no need to map to a higher dimensional feature space. The input space is already good enough. For such problems, linear methods are orders of magnitude faster with almost the same predictive performance. Examples of these methods can be found in LIBLINEAR or Vowpal Wabbit.

Linear methods are particularly interesting when you have many samples in a high dimensional input space. When you have only $500$ samples, using a nonlinear kernel method will also be cheap (since $N$ is small). If you had, say, $5.000.000$ samples in $10.000$ dimensions, kernel methods would be infeasible.

For low-dimensional problems with many training instances (so-called large $N$ small $p$ problems), linear methods may yield poor predictive accuracy. For such problems, ensemble methods such as EnsembleSVM provide nonlinear decision boundaries at significantly reduced computational cost compared to standard SVM.

edited Apr 13 '17 at 12:44

Community

1

answered Oct 28 '13 at 12:57

Marc Claesen

18,401

Many thanks for so detail answers, sir. I found in the high dimensions circumstances, if I use the RBF kernel in libsvm, it always overfitting, the classifier achieves a high accuracy but low accuracy in the testing set. And If I do dimension reduction before the classifier, and the reduced dimensions is close to the number of training samples, the classifier maybe achieve a good profit between training and testing set. Does the results fit most empirical results? Thanks. – mining Oct 28 '13 at 13:06
Kernel methods are fairly robust against high input dimensionality. Typically, you need not perform dimensionality reduction prior to using them. It is very important to tune all parameters, particularly gamma for the RBF kernel. The optimal value for gamma is related to the number of input dimensions. The most common tuning approach is cross-validation. If you used the same value for gamma with and without dimensionality reduction you are probably making a mistake. – Marc Claesen Oct 28 '13 at 13:10
Yes, sir. I usually use the grid.py in libsvm package to do cross-validation. And in most circumstances, for high dimensions data, the gamma always very small, such as 0.00001, this level. – mining Oct 28 '13 at 13:14
Hi, sir, I have checked your open source project EnsembleSVM, does it need to make the cross-validation procedure multithreading? And I think in the predict stage, it will be good that predicting the huge data in batches and multithreading or multi machines? – mining Oct 28 '13 at 13:46
Using multithreading is optional in EnsembleSVM, but enabled by default in esvm-train and esvm-predict. To disable multithreading, use the following flag in those tools: -threads 1. – Marc Claesen Oct 28 '13 at 13:48
That's very nice. I found in libsvm package, it seems to no multithreading, but I think it's very important in large scale data processing. – mining Oct 28 '13 at 13:53
@NicolasZhong thanks for the compliment. libsvm does indeed not offer multithreading. Training a single SVM is not straightforward to parallelize. – Marc Claesen Oct 28 '13 at 13:57
Thank you for sharing code. If I used EnsembleSVM, I will give feedback to you. – mining Oct 28 '13 at 14:03
@NicolasZhong thanks. We are always eager to get feedback from users! – Marc Claesen Oct 28 '13 at 14:05

What are the limitations of Kernel methods and when to use kernel methods?

1 Answers1

SVM for Large Scale Problems

Linked