Applying the "kernel trick" to linear methods?

Question

The kernel trick is used in several machine learning models (e.g. SVM). It was first introduced in the "Theoretical foundations of the potential function method in pattern recognition learning" paper in 1964.

The wikipedia definition says that it is

a method for using a linear classifier algorithm to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, where the linear classifier is subsequently used; this makes a linear classification in the new space equivalent to non-linear classification in the original space.

One example of a linear model that has been extended to non-linear problems is the kernel PCA. Can the kernel trick be applied to any linear model, or does it have certain restrictions?

BTW, kernels are not really essential for SVM's. The "heart" of SVM is the principle of soft margin maximization. Going to kernel representation makes your problem dimensionality O(m^2) instead of O(d) where m is the number of examples and d is the dimension of your feature space, so if m^2 is more than d, you might be better off doing away with kernels http://jmlr.csail.mit.edu/papers/v6/keerthi05a.html — Yaroslav Bulatov, Aug 27 '10 at 17:53
@Yaroslav: Thanks for the reference. Are you aware of any implementations of that "Modified Finite Newton Method"? — Shane, Sep 01 '10 at 15:23
no, but Keerthi and Langford's pages have links to some software that may be related, since they both worked at Yahoo Research — Yaroslav Bulatov, Sep 02 '10 at 03:53

score 18 · Accepted Answer · answered Aug 27 '10 at 17:16

18

The kernel trick can only be applied to linear models where the examples in the problem formulation appear as dot products (Support Vector Machines, PCA, etc).

answered Aug 27 '10 at 17:16

ebony1

2,203

@Yaroslav, @chl, @Sahne I've started this topic on chat, I think it is more comfortable place to discuss than comments ;-) http://chat.meta.stackoverflow.com/rooms/170/stats – Aug 27 '10 at 22:39
When you say "can only be applied to linear models" do you mean that you can prove that no application exit to something else than linear models? – robin girard Aug 28 '10 at 08:14
What do you mean by "examples" ? as you state it is seems to be something that can be interpreted as a dot product. – robin girard Aug 28 '10 at 08:16

score 7 · Answer 2 · answered Aug 28 '10 at 07:52

Two further references from B. Schölkopf:

Schölkopf, B. and Smola, A.J. (2002). Learning with kernels. The MIT Press.
Schölkopf, B., Tsuda, K., and Vert, J.-P. (2004). Kernel methods in computational biology. The MIT Press.

and a website dedicated to kernel machines.

score 2 · Answer 3 · edited Oct 14 '20 at 20:03

@ebony1 gives the key point (+1), I was a co-author of a paper discussing how to kernelize generalised linear models, e.g. logistic regression and Poisson regression, it is pretty straightforward.

G. C. Cawley, G. J. Janacek and N. L. C. Talbot, Generalised kernel machines, in Proceedings of the IEEE/INNS International Joint Conference on Neural Networks (IJCNN-2007), pages 1732-1737, Orlando, Florida, USA, August 12-17, 2007. (www,pdf)

I also wrote a (research quality) MATLAB toolbox (sadly no instructions), which you can find here.

Being able to model the target distribution is pretty useful in uncertainty quantification etc. so it is a useful (if fairly incremental) addition to kernel learning methods.

score 1 · Answer 4 · answered Aug 10 '23 at 07:19

1

I wonder if we do not have an elefant in the room here, called kernel ridge regression, to add to the examples already cited.

ridge regression is a regularized version of linear regression with regularizing parameter $\lambda$ (weight decay). When $\lambda=0$ ridge regression and linear regression should be equivalent. It is a linear model.
kernel ridge regression is a kernelized version of ridge regression. If we put $\lambda=0$ again we have a kernelized version of linear regression (probably not very useful).

I think adding this example to the list may be helpful for the reader of this question.

answered Aug 10 '23 at 07:19

Thomas

880
4
16

Kernel ridge regression is not really an elephant in the room, it is a quite well known model in ML (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a80bb875563cf3cd2e08b6eec36654fdfe37a92a) but better known as the Least-Squares Support Vector Machine (e.g. https://link.springer.com/article/10.1023/a:1018628609742) and also Kernel Fisher Discriminant analysys. It is implemented in the toolbox I mention in my answer and is one of the examples in the paper. – Dikran Marsupial Aug 10 '23 at 08:56
A nice feature of KRR is that it supports the "ridge trace" method for efficiently finding the regularisation parameter https://link.springer.com/article/10.1023/a:1018628609742 – Dikran Marsupial Aug 10 '23 at 09:00
1

Thanks. The derivation I know about KRR does not involve SVM and therefore maybe I thought it was an independent method. I will check the paper thanks for pointing it out again – Thomas Aug 10 '23 at 09:54
It is a pity that IIRC the KRR paper came out first, and it is a shame that the algorithm isn't often known by that name. – Dikran Marsupial Aug 10 '23 at 10:00

Applying the "kernel trick" to linear methods?

4 Answers4

Linked