Confused about kernel methods

Question

I understand that kernel methods are used to exploit nonlinearity in a data set. For example, let $\mathbf{x} = \begin{bmatrix}x_1\\x_2 \end{bmatrix}$. We can define the feature map $\phi(\mathbf{x}) = \begin{bmatrix}x_1^2 \\\sqrt{2}x_1x_2\\ x_2^2 \\ \end{bmatrix}$.

I understand that once we compute $\left <\phi(\mathbf{x}),\phi(\mathbf{y}) \right> = \phi(\mathbf{x})^T\phi(\mathbf{y})$, the tedious process multiplying each of the feature mappings is avoided, and that this specific example of the inner product reduces to $\left < \mathbf{x},\mathbf{y}\right>^2$.

Questions:

What is the Kernel matrix (Gram matrix) for? I do not understand the link from being a dot product, and somehow this transformed to a matrix.

$K = \begin{bmatrix} \phi(\mathbf{x}_1)^T\phi(\mathbf{x}_1) & \phi(\mathbf{x}_1)^T\phi(\mathbf{x}_2) & \cdots \\ \phi(\mathbf{x}_2)^T\phi(\mathbf{x}_1) & \vdots & \cdots \end{bmatrix}$

Much of the lectures I have read are hand-wavy, and suddenly subscripts appeared (or perhaps I was too thick to understand them). Can you kindly provide a sample calculation on how the claim that the feature mapping $\phi$ turns out to be unnecessary is possible, and that the matrix above becomes

$K = \begin{bmatrix} k(\mathbf{x}_1,\mathbf{x}_1) & k(\mathbf{x}_1,\mathbf{x}_2)& \cdots \\ k(\mathbf{x}_2,\mathbf{x}_1) & \vdots & \cdots \end{bmatrix}$.

Is there a closed form to the linear regression problem with a kernel trick?

Confused about kernel methods

0 Answers0