14

Background and problem

I am using Gaussian Processes (GP) for regression and subsequent Bayesian optimization (BO). For regression I use the gpml package for MATLAB with several custom-made modifications, but the problem is general.

It is a well-known fact that when two training inputs are too close in input space, the covariance matrix may become not-positive definite (there are several questions about it on this site). As a result, the Cholesky decomposition of the covariance matrix, necessary for various GP computations, may fail due to numerical error. This happened to me in several cases when performing BO with the objective functions I am using, and I'd like to fix it.

Proposed solutions

AFAIK, the standard solution to alleviate ill-conditioning is to add a ridge or nugget to the diagonal of the covariance matrix. For GP regression, this amounts to adding (or increasing, if already present) observation noise.

So far so good. I modified the code for exact inference of gpml so that whenever the Cholesky decomposition fails, I try to fix the covariance matrix to the closest symmetric positive definite (SPD) matrix in Frobenius norm, inspired by this MATLAB code by John d'Errico. The rationale is to minimize intervention on the original matrix.

This workaround does the job, but I noticed that the performance of BO reduced substantially for some functions -- possibly whenever the algorithm would need to zoom-in in some area (e.g., because it's getting nearer to the minimum, or because the length scales of the problem become non-uniformly small). This behaviour makes sense since I am effectively increasing the noise whenever two input points get too close, but of course it's not ideal. Alternatively, I could just remove problematic points, but again, sometimes I need the input points to be close.

Question

I don't think that numerical issues with Cholesky factorization of GP's covariance matrices is a novel problem, but to my surprise I couldn't find many solutions so far, aside of increasing the noise or removing points that are too close to each other. On the other hand, it is true that some of my functions are pretty badly behaved, so perhaps my situation is not so typical.

Any suggestion/reference that could be useful here?

Sycorax
  • 90,934
lacerbi
  • 5,186
  • You might look into forming the entries of the covariance matrix, as well as computing or updating its Cholesky factorization, in higher precision, for instance, quad precision or even higher. Aside from the hassle, the calculations may be orders of magnitude slower. There are arbitrary precision add-ons for MATLAB. I'm not saying this is ideal, but it may be an option. I don't know how well they play with gpml, but if you can change gpml source code (m files), perhaps you can do it. – Mark L. Stone Jan 07 '16 at 19:36
  • Did you try to add a small jitter to the diagonal of the covariance matrix? – Zen Jan 07 '16 at 20:24
  • @MarkL.Stone Thanks for the suggestion. Unfortunately I need the training code to be fast, so high-precision numerics is probably not going to be a good choice for my application. – lacerbi Jan 07 '16 at 20:27
  • @Zen You mean random jitter (e.g. normally distributed)? No, I didn't try, I just added a constant on the diagonal, if necessary, after converting to the closest SPD matrix. Why would random jitter work better than a constant? – lacerbi Jan 07 '16 at 20:33
  • If you add enough nugget/ridge the covariance matrix will be PD. Why do you then approximate (instead or in addition)?
  • Do you use the quadratic exponential as covariance function? If so then you might want to try other functions. The problem with ill-conditioning is particularly severe for the quadratic exponential.
  • – g g Jan 07 '16 at 22:22
  • @gg 1. I appreciate that an arbitrarily big nugget/ridge would fix the problem, but ideally here we want to minimize intervention on the original matrix. "The closest PD matrix in Frobenius norm" sounded like a reasonable solution. To be honest, I am not sure if this method works better in practice, I haven't tested it extensively. Anyhow, this is secondary, I get the same problem even if I simply use a ridge (unless I ramp it up to massive levels, with awful results on BO). – lacerbi Jan 07 '16 at 22:53
  • I know -- I am using rational quadratic (RQ) which is even worse than squared exponential. However, there is a good reason: RQ works really well for BO with my functions (up to the point when it crashes), so I was trying to keep it. Moreover, I get the same problem with a Matérn-5 covariance (possibly a bit less, I didn't check -- but it's definitely still there).
  • – lacerbi Jan 07 '16 at 22:58
  • 2
    This question is really interesting. When adding the nugget effect to your covaraince matrix such as $\sigma^{2}I$ do you optimize sigma in your likelihood , or is $\sigma$ given. I have noticed that optimizing the nugget effect captures measurement noise and help he gausssian process – Wis Aug 31 '16 at 21:25
  • 1
    I usually optimize. In a few cases I tried to marginalize over it, but didn't get much of an improvement wrt optimization (I assume the posterior was very narrow). – lacerbi Aug 31 '16 at 21:53