4

I'm reading this article and I can't really grasp the idea of this so-called kernel trick.

So far, what is present, is:

$ \Phi(x)^T * \Phi(y) = \sum x_ix_jy_iy_j$ and

$ k(x, y) = (x^T*y)^2 = \sum x_ix_jy_iy_j $

I don't see the difference. The "trick" is probably, that those equations are the same(?), so it is sufficient to calculate the lower one but I wonder where is the "trick", where is the computational saved effort as I would calculate the sum anyway, in both cases?

Ben
  • 3,443

2 Answers2

3

The example considers a particular feature mapping, $\Phi(x)$, which is 9 dimensional. Its dot product contains $9$ multiplications. In the trick version, you have $3$ multiplications (i.e. dot product of three dimensional vecs: $x^Ty$), and a squaring operation. There comes the computational efficiency.

Sometimes, the feature dimension can even be infinite, just like in RBF kernel. In such cases, the computational efficiency is certain.

gunes
  • 57,205
  • Thanks, but why wouldn't I use the sum expressions? Maybe it's too obvious so I miss the forest for the trees.. in other words: Why do I even care about the dot products when I can use the sum expressions? – Ben Jun 02 '21 at 06:34
  • 1
    The sum expression contains more multiplications.Aiso, sometimes, like in rbf kernel, you don’t have a finite sum – gunes Jun 02 '21 at 07:02
  • Ok, so in this case I use the sum expressions just to show that the equations are equal. But actually I will always(?) use k(x, y) which is subject to change. In (only) this case it is $ (x^T*y)^2$ ? – Ben Jun 02 '21 at 08:56
  • 1
    yes, you'll use $k(x,y)$ because usually $\Phi(x)$ has higher dimensions (compared to $x$) and will lead to more multiplications. In this case, it is $(x^Ty)^2$ as you mentioned. – gunes Jun 02 '21 at 09:20
  • Alright, got it! Thanks a lot! – Ben Jun 02 '21 at 11:24
2

Yes, the trick is that both equations compute the exact same thing. A kernel effectively computes the distance between two objects in some high-dimensional space, and the trick is that you never actually need to put anything into the high-dimensional space.

Mapping things into high-dimensional space can be expensive. The naive approach to finding distance in the high-dimensional space is to map your samples to that space, and then find the distance between those two objects in that space. The kernel trick allows you to never actually compute the representation of any sample in the high-dimensional space, you can just compute the distance between samples directly from the original features. The "trick" is that you never need to compute the kernel representation of any sample $\Phi(x)$, which can be costly.

  • Thanks, but why do I bother about the effort when I can use the sum expressions (in both ways). – Ben Jun 02 '21 at 06:37