Likelihood vs. noise kernel hyperparameter in GPML Toolbox

Question

I'm using GPML toolbox by C.E.Rasmussen to solve the basic GP regression problem (presented in the book) with noisy observations. That is to say, estimate the underlying function $f$ of a static noisy mapping

$$y = f(\mathbf{x}) + e, \qquad e \sim \mathcal{N}(0, \sigma^2)$$

from a set of training examples $\{ (\mathbf{x}_i, y_i) \}_{i=1}^{n}$. As far as I understand it, I should respect the noisiness of observations by choosing the kernel as the sum

$$ k(\mathbf{x}_i, \mathbf{x}_j) = k_f(\mathbf{x}_i, \mathbf{x}_j) + \sigma^2_{e}\delta_{ij}$$

where the final term in the sum is the kernel of the white noise (that is, noise of the observations).

When using GPML toolbox, for those who are familiar, you have to specify a likelihood. In my case I chose Gaussian likelihood, which has one hyperparameter - in the code documentation this correponds to formal parameter $s_n$.

So all together, when I perform optimization, I have one hyperparameter for the noise kernel ($\sigma_e$), one for the likelihood ($s_n$) and $d$ (say) hyperparameters for the $k_f$.

I am confused about the meaning of the hyperparameters $\sigma_e$ and $s_n$. Which one of the hyperparameters ($\sigma_e$ or $s_n$) represents the variance of the noise in the observations?

If the Gaussian likelihood is the measurement model, then $s_n$ should be the variance of the observations $y_i$, but then for what reason do we add the noise kernel (with additional hyperparameter ($\sigma_e$), which I think is redundant, at this point since we already have $s_n$ to do the job)? Perhaps they're one and the same and should be tied together during optimization. I'm confused.

GPML code for exact inference:

[n, D] = size(x);
K = feval(cov{:}, hyp.cov, x);  % evaluate covariance matrix
m = feval(mean{:}, hyp.mean, x); % evaluate mean vector
sn2 = exp(2*hyp.lik); % noise variance of likGauss
if sn2<1e-6           % very tiny sn2 can lead to numerical trouble
L = chol(K+sn2*eye(n)); sl =   1; % Cholesky factor of covariance with noise
pL = -solve_chol(L,eye(n));  % L = -inv(K+inv(sW^2))
else
L = chol(K/sn2 + eye(n)); sl = sn2; % Cholesky factor of B
pL = L; % L = chol(eye(n)+sW*sW'.*K)
end
alpha = solve_chol(L,y-m)/sl;

sn2 is the likelihood parameter, hyp.cov contains the kernel hyperparameters (including the noise kernel hyperparameter $\sigma_e$)

score 2 · Accepted Answer · answered Jul 09 '14 at 13:59

So I finally figured out the answer to my problem. The whole crux of my problem was the fundamental misunderstanding of the way one should go about implementing specific regression tasks in GPML toolbox. That is, the correspondence between task formulation and GPML implementation.

Now, to explain this, below is the problem formulation borrowed from the GPML book.

Formulation of GP regression with noisy observations, Rasmussen & Williams

You may be tempted, and rightly so, to go into GPML and implement the covariance function (2.20) like so:

    cov = {@covSum, {@covSEard, @covNoise}};
    lik = @likGauss;
    ... use minimize() and gp() ...

In this case your hyperparameters are:

    hyp.cov = [ell_1, ..., ell_D, sf, sigma_e];
    hyp.lik = [sn];

But in fact, what you should be doing is this:

    cov = @covSEard;
    lik = @likGauss;
    ... use minimize() and gp() ...

In this case your hyperparameters are:

    hyp.cov = [ell_1, ..., ell_D, sf]
    hyp.lik = [sn]    % here sn is identical to the sigma_e in (2.20)

So now you only have the likelihood parameter $s_n (=\sigma_e)$ for the observation noise. And the whole problem of which parameter controls what is gone.

I have to come to realize this, when I inspected the code for the paper "Robust Filtering and Smoothing with Gaussian Processes" available here: http://mloss.org/software/view/396/. In the paper, authors mention the use of the same covariance structure as in (2.20) and yet in the code you can see that:

the number of hyperparameters used is one less, than you would initially expect, (i.e. D+2 instead of D+3),
the hyperparameter $\sigma_e$ was used (in the code for inference) in the place, where the likelihood parameter $s_n$ is used in GPML toolbox.

From A. Wilson's PhD thesis: p.35 "...the noise variance $\sigma^2_e$ can be incorporated into the kernel (when we have a Gaussian likelihood) and treated as one of the kernel hyperparameters..." — Jacob, Dec 11 '15 at 14:16

score 1 · Answer 2 · answered Jun 06 '14 at 03:05

Although the Gaussian processes are defined for continuous time, in any specific example, you will be working with a finite data set and the Gaussian reduces to a multivariate normal. You will get a covariance matrix, derived from the kernel, and at some point in the analysis, this matrix will have to be inverted. However, that matrix will be virtually singular when the data points are close together. This comes from the smoothness of the covariance kernel.

I illustrate this fact on my blog.

So, to make the covariance matrix invertible, one has to throw in a bit of white noise.

The Gaussian kernel leads to random functions that look sort of wavy. The frequency of the waves (it's not periodic, but it does sort of slosh about) is determined by the hyperparameter you must supply to that kernel. This has nothing to do with the variance of the noise kernel.

I feel that I am explaining this very badly, but I recommend that you carry out problem 2.7.1 from Rasmussen and Williams (which I worked through on the blog post), and see for yourself what happens when you vary the two parameters you are working with: the scale parameter of the Gaussian kernel and the noise parameter.

Very useful nugget of information. Thank you for that. Although, as weird as it sounds, I'm not entirely sure it answers my question.
So you basically saying, that the noise kernel parameter models the noise variance (as it should), but that the likelihood parameter is there for the numerical reasons (so that I can safely invert prior covariance matrix).

Here is the relevant bit of the code for exact inference in GPML: `` — Jacob, Jun 06 '14 at 08:46
Edited the original question and added the relevant bit of the GPML code for exact inference. — Jacob, Jun 06 '14 at 09:05
@Jacob I get that. Sorry. In the definition of the covariance function for noisy Gaussian regression, there are two variance parameters. Your $k_f(x,y)=\sigma^2_f G(x,y)$, where $G$ is the standard Gaussian with variance 1. The variance of an individual observation is the sum of those variance parameters, and their relative magnitude controls how noisy the regressions are. When $\sigma_e$ is large, the correlation between neighbouring sample points is diluted. However, I have looked at the code, and the rest of the gpml toolbox, and I don't quite see why sn2 is defined the way it is. — Placidia, Jun 06 '14 at 15:16

Likelihood vs. noise kernel hyperparameter in GPML Toolbox

2 Answers2

Linked