Why people prefer neural network to kernel methods?

Question

I am learning Kernel methods. Kernel methods are less a "black box" than neural networks.

Nowadays, it seems neural networks gain more popularity and show more powerful in various applications, such as image, RL. Can anyone give some ideas or literature on the advantage of the networks over kernel? Or are there any deficits of kernel methods?

I guess the easy implementation and intuitive of the network may be an advantage.

I did not remember where I read the following claims. The kernel methods may work better in the middle-size dataset. But neural networks show a huge advantage on large datasets empirically. — Qinsheng Zhang, Jun 17 '20 at 02:53
This thread is an interesting read, in hindsight. https://stats.stackexchange.com/questions/30042/neural-networks-vs-support-vector-machines-are-the-second-definitely-superior Also: https://stats.stackexchange.com/questions/366581/what-can-deep-neural-networks-do-that-support-vector-machines-cant — Sycorax, Jun 17 '20 at 03:40
"Nowadays, it seems neural networks gain more popularity" popularity is not necessarily based on good reason - a lot of it is fashion/hype. Neural nets as a field have been through multiply hype-bust cycles during my career, where the tasks for which they are suitable become interesting for a while, and then people over-apply them because of the hype and find them dissapointing. Best to have lots of varied tools in your toolbox and know how to choose the right tool and apply them well. — Dikran Marsupial, Apr 18 '23 at 13:35

score 2 · Answer 1 · answered Jun 17 '20 at 05:42

I have been asking the same question myself recently. I think intuitively, you can think of kernel methods as obtaining a smooth function that approximates the data, with the smoothness being controlled by a few hyperparameters. The fact that we only have a few parameters to tune suggests that there are constraints in the way the function fits the data. As an example, consider the plot below, where I trained an svm with a Gaussian kernel to classify a cross in a 20x20 dot-matrix.

Unsurprisingly, it is able to highlight the area with the cross. However, we also see that outside the area of the cross, the surface can be a little "bumpy". This is akin to fitting a cubic spline to a density, where you cannot in general ensure that the predicted values don't fall below zero. I take these as the "constraints" in an SVM. More generally, in higher dimensions, these kinds of constraint limit the way the function approximates the data. For example, it is not possible to have one smoothness parameter for one part of the data and another for another part, unless you define a priori what these parts are. (In the example, I cannot constraint the classifier to have low smoothness around the cross and high smoothness elsewhere.)

I suppose NN don't have these kinds of constraints, as the large number of parameters allow them to take basically any shape they want. Hence, as long as we have a reasonably large amount of data, they outperform kernel methods.

score 2 · Answer 2 · answered Apr 18 '23 at 13:25

2

To my mind there are a few reasons for this:

We don't know how to train kernel methods efficiently on datasets are large as we can for neural networks
Similarly, neural networks allow more parallel computation
Mathematically, many kernel methods (such as kernel ridge regression) satisfy a representer theorem, which says that the learned function lives in the span of a finite set of evaluations of the kernel. I would guess that this is quite restrictive, especially in high dimensions (e.g., for inner product kernels), and that neural networks don't have this drawback.

(It's worth saying that what I've called a drawback above for kernel machines is actually a great strength when it comes to theoretical analysis. It might just not help practically)

answered Apr 18 '23 at 13:25

user27182

159
6

"I would guess that this is quite restrictive" quite the opposite, especially if your kernel is one with an infinite dimensional feature space and gives rise to universal approximation. Note there is a kernel that is very much like a sigmoidal neuron and very large neural networks (at least shallow ones) have been shown to be equivalent to Gaussian processes, which are (Bayesian) kernel methods. The representer theorem means you have a much better chance of finding the optimal solution. For neural networks we often rely on finding a good local minima to avoid overfitting - bit of a hack! – Dikran Marsupial Apr 18 '23 at 13:32
You are correct about universal approximation. In fact, it's not too hard to show that if the kernel function is strictly positive definite then the RKHS is dense in the space of continuous functions with respect to (IIRC) the uniform topology.
However, this guarantees that approximation is possible in principle, not in practice. Moreover, for infinite dimensional RKHSs, for any finite training samples there is an infinite dimensional space that's orthogonal to your learned predictor...
– user27182 Apr 18 '23 at 13:52
...The corresponding component of the target function can be shown to give a lower bound on the generalisation error in kernel regression. AFAIK the same issue does not seem to exist for NNs.
By the way, the NNGP correspondence holds for deep networks too https://arxiv.org/abs/1711.00165
– user27182 Apr 18 '23 at 13:53
I should say: I don't have any expertise (or much knowledge at all!) in approximation theory of either NNs or kernel machines. It could be that kernels have just as good approximation properties (e.g., in terms of rate, dependence on intrinsic data dimensionality, etc.), but this isn't so relevant to my comment about generalisation and the finite dimensional span. – user27182 Apr 18 '23 at 14:00
"However, this guarantees that approximation is possible in principle, not in practice. " I don't see how that could be any different for NNs, especially as the NN has a non-convex loss function. Yes, obviously the Gram matrix describes a subspace of the infinite dimensional feature space, which is why they can be evaluated with finite maths, but the orthogonal space is irrelevant to minimising the cost function, so it is not clear why that is a problem. Thanks for the link to the report - doesn't it suggest though that GPs are more capable than DNNs (ignoring computational expense)? – Dikran Marsupial Apr 18 '23 at 14:01
"the finite dimensional span" this is the whole point of the representer theorem, it tells you that no point in the space outside the span of the images of the training data are useful in minimising the loss - so they can be safely ignored. This is a pure bonus, not a disadvantage AFAICS. – Dikran Marsupial Apr 18 '23 at 14:03
It's irrelevant to minimising the training loss, but it's the population error that we care about. For this, the space outside of the span of representers cannot be ignored. – user27182 Apr 18 '23 at 14:15
all of the information the model has with which it can generalise is contained in the training data. The same is true for neural nets. If we could build classifiers by minimising the generalisation loss then inference would be much easier, but unfortunately we can't. So we have to make do with the training data and the loss defined over it. – Dikran Marsupial Apr 18 '23 at 14:20
Yes, we train the models by looking at the training data. But the performance of the model is dictated by how well it does on the population distribution. The whole point of supervised machine learning is to extrapolate from the former to the latter. I'm telling you about a limitation in doing this for kernel methods, and the limitation has to do with the limitation of the learned predictor to the finite dimensional span of representers. – user27182 Apr 18 '23 at 14:38
see this paper: https://arxiv.org/abs/2205.13525 – user27182 Apr 18 '23 at 14:41
There is no information in the training data outside the subspace that is not also present inside. Please explain how the performance of the model may be improved if there is no additional information to be extracted from the training data. " I'm telling you about a limitation " it isn't a limitation. – Dikran Marsupial Apr 18 '23 at 14:41
A paper on "Kernel Ridgeless Regression" tells you nothing about the sort of kernel machines that people actually use. Kernel machines don't rely on "benign overfitting" in the way that DNN do, they use regularsation to control complexity. – Dikran Marsupial Apr 18 '23 at 14:43
The whole point is that the training data doesn't tell you everything about the population distribution, against which the ultimate performance of the model is generalised.
If you look at the paper I linked, the pertinent aspect is that there is a component of the target outside of the span of representers and this contributes to the population risk.
– user27182 Apr 18 '23 at 15:49
The performance of a model outside of the training data depends on its inductive bias. What makes one model perform better than another, modulo fit on the training data, is its inductive bias. The point here is that the inductive bias of kernel machines is different to that of neural networks. I am pointing out that the inductive bias of kernel machines has a weakness, i.e., the limitation to the span of representers. – user27182 Apr 18 '23 at 15:52
you have not answered my question, so it seems pointless for me to continue this discussion. I also pointed out why that paper was irrelevant to the question and you ignored that point. " I am pointing out that the inductive bias of kernel machines has a weakness" is moving the goal posts. This does not suggest further discussion would be productive, – Dikran Marsupial Apr 18 '23 at 15:57
It's not moving the goal posts at all. How would you judge the performance of a model? Surely by it's population risk, no? – user27182 Apr 18 '23 at 16:05
Sorry, I am not going to be drawn back into the discussion, especially as you still have not answered my question, nor addressed the point of the irrelevance of the paper you mentioned. – Dikran Marsupial Apr 18 '23 at 16:08
I did! I literally told you that the paper shows that because the predictor is in the span of representers, it has test risk that's bounded away from 0. ANY predictor in this span will clearly have this issue. The fact that's it's ridgeless just changes where it is in the span. – user27182 Apr 18 '23 at 16:09
Let us continue this discussion in chat. – Dikran Marsupial Apr 18 '23 at 16:16

Why people prefer neural network to kernel methods?

2 Answers2