Questions tagged [kernel-trick]

Kernel methods are used in machine learning to generalize linear techniques to nonlinear situations, especially SVMs, PCA, and GPs. Not to be confused with [kernel-smoothing], for kernel density estimation (KDE) and kernel regression.

In machine learning, the kernel trick is a widely applied method to generalize linear techniques to non-linear cases. The most widely used applications include support vector machines (for classification, regression, and anomaly detection), Gaussian processes (for classification and regression), and principal components analysis (for dimensionality reduction). Such uses are also known as kernel methods.

A kernel is a function $k : \mathcal X \times \mathcal X \to \mathbb R$ that can be thought of roughly as a similarity function on the domain $\mathcal X$. Kernel functions exist for many domains, including $\mathbb R^n$ (in which case they can allow more complicated nonlinear relationships) as well as sets, graphs, strings, probability distributions, and other complicated objects. César R. Souza has cataloged many common kernel functions in Kernel Functions for Machine Learning Applications.

The kernel trick works because if $k$ is a positive semidefinite function, then there is a corresponding Hilbert space $\mathcal H$, known as the reproducing kernel Hilbert space (RKHS) of $k$, and a "feature map" $\varphi : \mathcal X \to \mathcal H$ such that $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$. Thus, if an algorithm accesses the data only in the form of inner products $x^T y$, it can be "kernelized" by simply replacing those inner products with $k(x, y)$, in which case it corresponds to performing the algorithm in the Hilbert space $\mathcal H$. For many common kernels, $\mathcal H$ is very high- or even infinite-dimensional, so that actually representing the data in that space would be impossible, but by using pairwise kernel evaluations the algorithm can still be run.

For large datasets, pairwise evaluations can be too computationally expensive to be practical. In these cases approximations such as the Nyström method (which approximates the kernel function based on kernel evaluations to landmark points) or approximate embeddings (which give a function $z : \mathcal X \to \mathbb R^D$ such that $z(x)^T z(y) \approx k(x, y)$) can be used.


Note that the word "kernel" is also used to refer to the local similarity functions of kernel smoothing techniques like kernel density estimation and Nadaraya-Watson kernel regression. See [kernel-smoothing] for this usage.

751 questions
15
votes
2 answers

Eigenfunctions and eigenvalues of the exponential kernel

What are the eigenfunctions and the eigenvalues of the exponential kernel? The exponential kernel is defined as $$k(x,x')=\sigma^2\exp\left(\frac{||x-x'||}{l}\right)$$ where both $\sigma>0$ and $l>0$. Mercers theorem tell us that for every kernel…
Julian Karch
  • 1,890
  • 1
  • 18
  • 29
9
votes
2 answers

Linear combination of two kernel functions

How can I prove that linear combination of two kernel functions is also a kernel function? \begin{align} k_{p}( x, y) = a_1k_1( x, y) + a_2k_2(x,y) \end{align} given $k_1(,)$ and $k_2(,)$ are valid kernel functions. In general to prove any such…
6
votes
2 answers

Is this a decent summary of the kernel trick?

Here's my understanding of the kernel trick. The motivation is to find a linear separator in a higher dimensional space than what you have (because the data are not currently linearly separable.) You take the dot product, and then apply the…
6
votes
1 answer

Is $\min(f(x)g(y),f(y)g(x))$ a positive definite kernel?

It is known that $(x,y)\in \mathbb{R}^2 \mapsto \min(x,y)$ is a positive definite kernel. Can we generalize this result in the following way : Let $X$ be any set and $f,g:X\longrightarrow \mathbb{R}^{+}$ be non-negative functions. Is $$k:(x,y)\in…
dada
  • 219
  • 1
  • 6
5
votes
1 answer

Moore-Aronszajn Theorem and Mercer theorem for the kernel trick

I have been reading about the RKHS and the kernel trick in Machine Learning mainly from https://ngilshie.github.io/jekyll/update/2018/02/01/RKHS.html (1) and https://arxiv.org/pdf/2106.08443.pdf (2). But in (1), it is stated that because the…
endeavor
  • 183
5
votes
1 answer

Proof that exponential of a kernel is a kernel

How can I prove that the exponential $\exp(K)$ of a kernel function $K$ is again a kernel? I think it can be proved using Taylor expansion but I am not sure how.
Andreas G.
  • 1,445
  • 1
  • 11
  • 20
4
votes
2 answers

Idea of the kernel trick

I'm reading this article and I can't really grasp the idea of this so-called kernel trick. So far, what is present, is: $ \Phi(x)^T * \Phi(y) = \sum x_ix_jy_iy_j$ and $ k(x, y) = (x^T*y)^2 = \sum x_ix_jy_iy_j $ I don't see the difference. The…
Ben
  • 3,443
4
votes
1 answer

How can we prove that a normalized kernel is also a kernel?

How can we prove that the normalized kernel is a kernel? That is how can we show $\frac{K(x,y)}{ \sqrt{( K(x,x) K(y,y) )}}$ is a valid kernel. Question: Also in real world, why do we normalize the kernels? Do we also normalize the Gaussian Kernel…
3
votes
0 answers

Kernel:Why is the dot product a "measure of similarity" of instances?

Not a duplicate since the linked question does not answer this question: A measure of similarity should be maximal for instances which are the same (e.g. similarity between (1,1) and (1,1) should be higher than the similarity between (1, 1) and (1,…
PascalIv
  • 819
3
votes
1 answer

Finding the feature map corresponding to a specific Kernel? (Polynomial Kernels)

I am just getting into machine learning and I am kind of confused about how to show the corresponding feature map for a kernel. For example, how would I show the following feature map for this kernel? $K(x,y) = (x \cdot y)^3 + x \cdot y$ Any help…
skidjoe
  • 211
3
votes
1 answer

Inner Product Kernel $k(x,y) = (1+\epsilon)^{\langle x, y \rangle}$

Where in the literature is the inner product kernel $k(x,y) = (1+\epsilon)^{\langle x, y \rangle}$ mentioned? Does it have a name?
Max Flow
  • 133
2
votes
1 answer

Is $\min(k_1(x, y), k_2(x, y))$ a positive definite kernel?

It is known that $(x,y)\in \mathbb{R}^2 \mapsto \min(x,y)$ is a positive definite kernel. Can we generalize this result in the following way : Let $k_1(x, y)$ and $k_2(x, y)$ be any two positive definite kernels. Is $$k:(x,y)\in X^2 \mapsto…
Peyman
  • 309
2
votes
0 answers

Why isn't the reproducing kernel map unique?

I am working on a project using kernel PCA with a gaussian kernel, and I am trying to understand a part of the theory. According Mercer's thereom, I know that since the RBF kernel is PDS, there exists a reproducing kernel map $\phi_x$ and associated…
Paul
  • 436
  • 1
  • 5
  • 14
2
votes
1 answer

Combination of two kernel functions

Could you help me with this kernel function? \begin{equation} K(x,y) = (x \cdot y)^{2} + (x \cdot y), \text{ where } x = (x_1, x_2)', y = (y_1, y_2)' \end{equation} I want to know if the combination of two kernel functions is still a kernel…
2
votes
1 answer

Is $f(x)=e^{x^Tx'}$ a suitable kernel to be choosen?

Is $f(x)=e^{x^Tx'}$ a suitable kernel to be choosen? If so, to what dimension does it transform the data?
Gigili
  • 845
1
2 3