3

Can anyone give some examples of widely-used kernels between sets of vectors?

A kernel between sets of vectors is a positive semidefinite map $k:(S,T)\mapsto k(S,T)\in \mathbb{R}$ where $S$ and $T$ are sets of vectors (i.e., $S=\{s_1,s_2,\ldots,s_m\}$, $T=\{t_1,t_2,\ldots,t_n\}$, and $s_i,t_j\in \mathbb{R}^d$.

Thank you!

Alex
  • 287
  • There are quite a few in the Gaussian process for Machine Learning book. Probably the most common is the squared exponential kernel, where you give a positive definite matrix $\Sigma$ and the kernel is $\kappa(x,y; \Sigma) = \exp\left(x^t \Sigma y\right)$. Using this kernel is akin to assuming a Gaussian relationship between the components of $x_i$. Another excellent reference is the kernel cookbook, which has a good set of visualizations to go with it. – combo Jan 24 '17 at 21:55
  • @StefanJorgensen I think you missed that the inputs to the kernel are supposed to be sets of vectors, not single vectors. – Danica Jan 24 '17 at 22:10
  • Indeed. But is it fundamentally different? The collection of vectors $S = {s_1,\dots,s_m}$ is just a matrix, which one could vectorize and use the standard kernels for, right? – combo Jan 25 '17 at 16:22
  • 2
    @StefanJorgensen Sure...if you're willing to only support sets of a single fixed size, and to require factorially times as many inputs for the learning method to pick up on the set's permutation invariance. – Danica Jan 30 '17 at 23:47

1 Answers1

7

Probably the most common one in machine learning usage is the mean map kernel (the kernel induced by the maximum mean discrepancy [MMD] distance). Here we define an auxiliary kernel $\kappa$, and then $$k(S, T) = \frac{1}{mn} \sum_{i=1}^m \sum_{i=1}^n \kappa(s_i, t_j).$$

This essentially makes the assumption that $S$ and $T$ are iid samples from some distributions $P$ and $Q$, and estimates a distance between $P$ and $Q$. If you use a Gaussian RBF kernel of bandwidth $\sigma$, the mean map embedding distance converges to a multiple of the $L_2$ distance as $\sigma \to 0$, but you'd typically use some fixed larger $\sigma$, in which case your statistical estimation properties are better.

You can see an overview of many such kernels and estimators of them in, uh, my PhD thesis from last year.

Danica
  • 24,685