Ill-conditioned covariance matrix in GP regression for Bayesian optimization

Question

Background and problem

I am using Gaussian Processes (GP) for regression and subsequent Bayesian optimization (BO). For regression I use the gpml package for MATLAB with several custom-made modifications, but the problem is general.

It is a well-known fact that when two training inputs are too close in input space, the covariance matrix may become not-positive definite (there are several questions about it on this site). As a result, the Cholesky decomposition of the covariance matrix, necessary for various GP computations, may fail due to numerical error. This happened to me in several cases when performing BO with the objective functions I am using, and I'd like to fix it.

Proposed solutions

AFAIK, the standard solution to alleviate ill-conditioning is to add a ridge or nugget to the diagonal of the covariance matrix. For GP regression, this amounts to adding (or increasing, if already present) observation noise.

So far so good. I modified the code for exact inference of gpml so that whenever the Cholesky decomposition fails, I try to fix the covariance matrix to the closest symmetric positive definite (SPD) matrix in Frobenius norm, inspired by this MATLAB code by John d'Errico. The rationale is to minimize intervention on the original matrix.

This workaround does the job, but I noticed that the performance of BO reduced substantially for some functions -- possibly whenever the algorithm would need to zoom-in in some area (e.g., because it's getting nearer to the minimum, or because the length scales of the problem become non-uniformly small). This behaviour makes sense since I am effectively increasing the noise whenever two input points get too close, but of course it's not ideal. Alternatively, I could just remove problematic points, but again, sometimes I need the input points to be close.

Question

I don't think that numerical issues with Cholesky factorization of GP's covariance matrices is a novel problem, but to my surprise I couldn't find many solutions so far, aside of increasing the noise or removing points that are too close to each other. On the other hand, it is true that some of my functions are pretty badly behaved, so perhaps my situation is not so typical.

Any suggestion/reference that could be useful here?

You might look into forming the entries of the covariance matrix, as well as computing or updating its Cholesky factorization, in higher precision, for instance, quad precision or even higher. Aside from the hassle, the calculations may be orders of magnitude slower. There are arbitrary precision add-ons for MATLAB. I'm not saying this is ideal, but it may be an option. I don't know how well they play with gpml, but if you can change gpml source code (m files), perhaps you can do it. — Mark L. Stone, Jan 07 '16 at 19:36
Did you try to add a small jitter to the diagonal of the covariance matrix? — Zen, Jan 07 '16 at 20:24
@MarkL.Stone Thanks for the suggestion. Unfortunately I need the training code to be fast, so high-precision numerics is probably not going to be a good choice for my application. — lacerbi, Jan 07 '16 at 20:27
@Zen You mean random jitter (e.g. normally distributed)? No, I didn't try, I just added a constant on the diagonal, if necessary, after converting to the closest SPD matrix. Why would random jitter work better than a constant? — lacerbi, Jan 07 '16 at 20:33
@gg 1. I appreciate that an arbitrarily big nugget/ridge would fix the problem, but ideally here we want to minimize intervention on the original matrix. "The closest PD matrix in Frobenius norm" sounded like a reasonable solution. To be honest, I am not sure if this method works better in practice, I haven't tested it extensively. Anyhow, this is secondary, I get the same problem even if I simply use a ridge (unless I ramp it up to massive levels, with awful results on BO). — lacerbi, Jan 07 '16 at 22:53

j__ · Accepted Answer · 2016-01-08T17:24:42.963

Another option is to essentially average the points causing - for example if you have 1000 points and 50 cause issues, you could take the optimal low rank approximation using the first 950 eigenvalues / vectors. However, this isn't far off removing the datapoints close together which you said you would rather not do. Please bear in mind though that as you add jitter you reduce the degrees of freedom - ie each point influences your prediction less, so this could be worse than using less points.

Another option (which I personally think is neat) is to combine the two points in a slights smarter way. You could for instance take 2 points and combine them into one but also use them to determine an approximation for the gradient too. To include gradient information all you need from your kernel is to find $dxk(x,x')$ and $dxdx'k(x,x')$. Derivatives usually have no correlation with their observation so you don't run into conditioning issues and retain local information.

Edit:

Based on the comments I thought I would elaborate what I meant by including derivative observations. If we use a gaussian kernel (as an example),

$k_{x,x'} = k(x, x') = \sigma\exp(-\frac{(x-x')^2}{l^2})$

its derivatives are,

$k_{dx,x'} =\frac{dk(x, x')}{dx} = - \frac{2(x-x')}{l^2} \sigma\exp(-\frac{(x-x')^2}{l^2})$

$k_{dx,dx'} =\frac{d^2k(x, x')}{dxdx'} = 2 \frac{l^2 - 2(x-x')}{l^4} \sigma\exp(-\frac{(x-x')^2}{l^2})$

Now, let us assume we have some data point $\{x_i, y_i ; i = 1,...,n \}$ and a derivative at $x_1$ which I'll call $m_1$.

Let $Y = [m_1, y_1, \dots, y_n]$, then we use a single standard GP with covariance matrix as,

$K = \left( \begin{array}{cccc} k_{dx_0,dx_0} & k_{dx_0,x_0} & \dots & k_{dx_0,x_n} \\ k_{dx_0,x_0} & k_{x_0,x_0} & \dots & k_{x_0,x_n} \\ \vdots & \vdots & \ddots & \vdots \\ k_{dx_0,x_n} & k_{x_0,x_n} & \dots & k_{x_n,x_n} \end{array} \right)$

The rest of the GP is the same as usual.

Would you care to expand the details on your proposed use of approximate gradient information? — Mark L. Stone, Jan 08 '16 at 14:23
@j Thanks -- I thought about doing a low-rank approximation, I might give it a try (avoided it so far since I might have to rewrite large parts of the code). Regarding combining two points into one, I had proposed it in a previous question, but I didn't think about getting derivative information. In principle it sounds neat but I am not sure how I would use it since I would only have a few derivative observations (corresponding to the merged points), with the burden of adding one GP per input dimension. — lacerbi, Jan 08 '16 at 16:50
@j Thanks for the additional explanation. This looks very neat indeed. Do you have references for this approach (or something similar enough)? — lacerbi, Jan 08 '16 at 17:44
Check out Mike Osborne's thesis page 67 (http://www.robots.ox.ac.uk/~mosb/public/pdf/136/full_thesis.pdf) - he introduces derivative and integral observations. Hope it helps :) — j__, Jan 08 '16 at 17:47

score 4 · Answer 2 · edited Apr 13 '17 at 12:44

One solution that we've kicked around at the office is to just alter the troublesome points. This can take the form of straight-up deletion or something more sophisticated. Essentially, the observation is that close-by points are highly redundant: in fact, so redundant that they reduce the rank of the covariance matrix. By the same token, one point is contributing little information to the problem at hand anyway, so removing one or the other (or doing something else, like averaging them or "bouncing" one point away from the other to some minimal acceptable distance) will not really change your solution all that much.

I'm not sure how to judge at what point the two points become "too close." Perhaps this could be a tuning option left to the user.

(Oops! After I posted this, I found your question here which advances this answer to a much more elaborate solution. I hope that by linking to it from my answer, I'll be helping with SEO...)

this is quite helpful, can you please also shed some light to this if possible. — GENIVI-LEARNER, Feb 06 '20 at 16:12

Ill-conditioned covariance matrix in GP regression for Bayesian optimization

2 Answers2

Edit:

Linked