Applying an RBF kernel first and then train using a Linear Classifier

Question

I will start off by saying that I don't have a concrete understanding of what's under the hood of a SVM classifier.

I am interested in using an SVM with the RBF kernel to train a two class classifier. I however find that the training (and even prediction) takes a lot of time when working with the RBF kernel (have been implementing on matlab using libsvm and python using sklearn).

My question is that is it possible to project my data into the higher dimension using an RBF kernel on its own and then apply a linear SVM to this transformed data, i.e. is that going to yield the same results as using an RBF SVM, as long as I use the same C and Gamma? I am not too sure how kernels are applied so I hope this part makes sense.

If that is true, that I could pre-process the data into the higher dimensional feature space using the RBF kernel and then the training and prediction much fast by using a simple linear classifier.

score 4 · Answer 1 · answered Jun 10 '15 at 19:38

As the previous answers say, RBF kernels embed data points into an infinite-dimensional space. But it turns out you can approximate that in a finite-dimensional space, as proposed by the paper Random Features for Large-Scale Kernel Machines by Rahimi and Recht, NIPS 2007. The method is also sometimes called "random kitchen sinks."

The gist of the method is: if you want to use an RBF kernel $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$, then you can get a feature map $z : \mathbb R^d \to \mathbb R^D$ such that $k(x, y) \approx z(x)^T z(y)$ by:

Sample $D/2$ $d$-dimensional weight vectors $\omega_i \sim \mathcal{N}(0, \frac{1}{\sigma^2} I)$.
Define $z(x) = \begin{bmatrix} \sin(\omega_1^T x) & \cos(\omega_1^T x) & \cdots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x) \end{bmatrix}^T$.

Then you can train a linear SVM on these features, which will approximate the RBF-kernel SVM on the original features.

(Note that the linked version of the paper doesn't discuss this particular version, but rather one that seems like it might be better but my (very) recent paper argues is worse.)

This is implemented in scikit-learn, shogun, and JSAT.

There's also a method called Fastfood (Le, Sarlós, and Smola, Fastfood – Approximating Kernel Expansions in Loglinear Time, ICML 13) that speeds up the method for large $d$ and decreases storage requirements. Good implementations are more complicated, though. Here's one for scikit-learn that's okay, but I might work on making more parallelized soon; there's also a matlab one, and a Shark/C++ one that I haven't tested.

score 0 · Answer 2 · answered Jul 25 '14 at 03:58

Typically what you describe is more or less how kernel methods work. First there's a projection in a high dimensional space of pairwise similarities of samples and then a linear large-margin classifier is applied to that space. The reason that this naif approach is not followed directly by modern implementations is due to the fact that storing the full Kernel matrix can be ridiculously expensive (N^2 space + the computational cost of computing the kernel function on every pair of samples).

If you find that SVMs are slow for the volume of the data you have then probably you need a different classifier with lower complexity, or a linear kernel. Also, you need to choose if it makes sense to work on the dual or the primal. That depends on the number of features and number of samples and what's computationally less expensive.

Thanks! More explicitly, I am working with a training set of N = 5000 observations and D = 200 dimensions. It takes 2 mins to train an SVM RBF.
During prediction on a test set of T_s = 50,000 observations, it takes about 6 minutes, which is too slow for my purposes. I know I could parallelise the prediction but was thinking if it would be possible to store the full kernel matrix before hand.

Can you please provide more details on how much space would be needed to store a NxD data set. Will be it N^2 x D? Also does the prediction time scale linearly as T_s increases? — Sooshii, Jul 25 '14 at 04:31

score 0 · Answer 3 · edited Apr 13 '17 at 12:44

The problem is that RBF kernel embeds points into an infinite dimensional space. So, even if you would be able to "approximate" this embedding with finite number of features, it'd be extremely (exponentially) large.

Indeed, RBF kernel captures all possible feature combinations (because it's like a combination of many polynomial kernels, and a polynomial kernel of degree d captures relations of combinations of d features, like if you had features of the form $f_{k_1} \times \dots \times f_{k_d}$, see the derivation via the link above).

That means if you have n features, you'll have to augment your data with $n (n-1) / 2$ features, corresponding to all possible pairs, $n(n-1)(n-2)/3$ — for all triples, and, in general, $n \choose k$ (binomial coefficient, choose k features from n) for combinations of $k$ features. This gives you

$$ \sum_{k=2}^n {n \choose k} = 2^n - n -1 $$

Applying an RBF kernel first and then train using a Linear Classifier

3 Answers3