6

In the paper

Rahimi, Ali, and Benjamin Recht. "Random features for large-scale kernel machines." Advances in neural information processing systems. 2008.

the author introduces a way to approximate stationary kernel function by a way of random sampling in the feature space. The author did an experiment to solve the following ridge regression

$$ \min_{w} ||Z'w - y||_2^2 + \lambda ||w||_2^2 $$

where $w$ would be a vector of $D$ dimension.

The dimension of $w$ does not make sense to me. In order to approximate the kernel function with sufficient accuracy, we need to use a high number of $D$. It gives me a feeling of a high risk of overfitting the model and my model is appeared to be overfitting when I am trying to make use of it.

A kernel ridge regression can be described by the following equation:

$$ f(x) = \sum_{i=1}^N \alpha_i k(x,x_i) $$

for $\alpha_i$ is computed by $\alpha = (K + \sigma^2 I)^{-1} y$.

In the paper, the author is trying to approximate $k(x,y)$ as

$$ \begin{array} \\ k(x,y) & = k(x-y) \\ & \approx \operatorname{E}(z(x)'z(y)) \\ & = \displaystyle \frac{1}{D} \sum_{j=0}^{D} z_{\omega_j}(x)z_{\omega_j}(y) \\ & \mbox{ for }\omega_j\mbox{ draw from }p(\omega)\\ \end{array} $$

Isn't the true approximation to the kernel machine should be

  1. For each data points $x_i$, $x_j$:
  2. Compute $K_{ij} = k(x_i,x_j) \approx \frac{1}{D} \sum_{k=0}^{D} z_{\omega_k}(x_i)'z_{\omega_k}(x_j)$
  3. Solve $(K + \sigma^2I)^{-1} y$

rather than sampling $D$ numbers of $z_\omega(x)$ and then fit a parametric model with $D$ parameters?

I think the author is trying to fit a linear model in the feature space (as $z(x)$ is the feature map) rather than the standard kernel trick which does not need to evaluate the feature map. But I don't understand why the author do not need to compute the sample average of $K$ (or do something similar to $z$)?

The implementation here is also fitting a model of $D$ parameter, no averaging step is done, which makes me quite confusing.

Let's say there are 1000 data points and the feature space is of infinite dimension and requires $D=10000$ to approximate $K$ with sufficient accuracy, am I going to fit 10000 parameters over 1000 data points?

What am I missing?

Sycorax
  • 90,934
wh0
  • 421

1 Answers1

3

Isn't the true approximation to the kernel machine should be

  1. For each data points $x_i$, $x_j$:
  2. Compute $K_{ij} = k(x_i,x_j) \approx \frac{1}{D} \sum_{k=0}^{D} z_{\omega_k}(x_i)'z_{\omega_k}(x_j)$
  3. Solve $(K + \sigma^2I)^{-1} y$

rather than sampling $D$ numbers of $z_\omega(x)$ and then fit a parametric model with $D$ parameters?

No, the whole point of the random kitchen sink is to turn the problem $y = K\alpha$ into the problem $y \approx Z^\top Z \alpha = Z^\top \beta$, where $Z$ is the random kitchen sink basis. I've explained how this works in this answer.

But I don't understand why the author do not need to compute the sample average of $K$ (or do something similar to $z$)?

"Random Features for Large-Scale Kernel Machines" gives a coefficient $\sqrt{\frac{1}{D}}$ to the Fourier basis matrix. Expressed in this way, $K \approx Z^\top Z$ has the desired averages in each element. (This is described in the two paragraphs after equation 2.)

Let's say there are 1000 data points and the feature space is of infinite dimension ...

The phrase "Let's say there are 1000 data points and the feature space is of infinite dimension" makes me think that you've misunderstood. The dimension of the feature space is how many "columns" your data has. If you truly have infinite columns, this method is of no help.

If you mean that the RBF kernel's explicit feature map has infinite dimension, then this is not relevant to random kitchen sinks. The whole point of the random kitchen sinks procedure is to create a feature map with relatively few features, such that its inner product is approximately the RBF kernel.

Let's say there are 1000 data points and the feature space is of infinite dimension and requires $D = 10000$ to approximate $K$ with sufficient accuracy, am I going to fit $10000$ parameters over $1000$ data points?

It's up to you to choose $D$.

  • You can estimate more parameters than you have data points using regularization to mitigate over fitting.
  • Or you can accept a larger amount of error in the approximation to $K$ and choose smaller $D$.

The big idea of random kitchen sinks is to use an approximation that trades a problem that involves inverting $K$ (which is $N \times N$) to a problem that's solving a linear system where $Z$ has shape $N \times D$. Whenever $D \ll N$, random kitchen sinks is much cheaper. On the other hand, you might want more features to improve model accuracy. From the perspective of model quality, the approximation to $K$ is not really the most important concern; what you care about is whether the final model solve your problem (has error low enough to be useful).

Sycorax
  • 90,934