Most suitable optimizer for the Gaussain process likelihood function

Question

The Gaussian process (GP) log Likelihood function can be expressed as

Where K is a positive definite covariance matrix. The hyperparameters can be obtained through maximizing the likelihood function . It is well know that given this non linear unconstrained optimization , local maxima can provide good estimates for the gaussian process, ie.. the method is robust if you cannot find the glocal maximizer. My question is what is the best method to maximize the GP likelihood. From my experience it seems that gradient based methods work best , however I would like any recommendation on optimizing techniques most fit for the GP, especially techniques available in R

May I ask why would you accept an answer without upvoting it? I know you didn't upvote because I did and the answer still has only 1 upvote. Any answer that is good enough to accept is definitely good enough to upvote. — amoeba, Mar 24 '16 at 21:54

score 10 · Accepted Answer · answered Mar 24 '16 at 21:32

I think this is an open-ended question because a lot will depend on the actual dataset you are optimising against, how close your first candidate solution $s_0$ is to a local optimum and if you are interested / are able to use derivative information or not.

I have used R's standard optim function and generally I have found that the L-BFGS-B algorithm is the fastest or close to the fastest from the default optimisation algorithms available. That is when I supply a derivative function. In the GPML Matlab Code the authors also provide a L-BFGS-B implementation so I suspect they too found that the L-BFGS-B algorithm is reasonably competitive when someone provides derivative information within the context of a general application.

Another option is to use derivative free optimisation. Rios and Sahinidis, 2013 review paper: "Derivative-free optimization: A review of algorithms and comparison of software implementations" in the Journal of Global Optimisation seems to be your best bet for something exhaustive. Within R the minqa package that provide derivative-free optimization by quadratic approximation (QA) routines. The package contains some of Powell's most famous "optimisation children": UOBYQA, NEWUOA and BOBYQA. I have found UOBYQA to be the fastest of the three for toy problems despite Wikipedia general advice: "For general usage, NEWUOA is recommended to replace UOBYQA.". This is not very surprising, log-likelihoods are smooth functions with well-defined derivatives so NEWUOA might not a enjoy an obvious advantage. Again this shows that there is no silver-bullet. On that matter, I have played around with some Particle Swarm Optimisation (PSO) and Covariance Matrix Adaptation Evolution Strategy algorithms included in the R package hydroPSO and cmaes respectively but in general while faster and far more informative than the canned Simulated Annealing (SANN) in optim they were not remotely competitive in terms of speed with QA routines. Notice that estimating the hyper-parameters vector $\theta$ for a log-likelihood function is usually a smooth and (at least locally) convex problem so stochastic optimisation generally will not offer a great advantage.

To recap: I would suggest using L-BFGS-B with derivative information. If derivative information is hard to obtain (eg. due to complicated kernels functions) use quadratic approximation routines.

+1. Disclaimer: I understand very little of this answer and came across it only by chance, but am impressed by the quality of the overview and by the amount of hands-on knowledge here. — amoeba, Mar 24 '16 at 21:42
+1. Agree with everything you say, same experience. Just a small note -- since when GPML for MATLAB comes with a L-BFGS-B optimizer? It used to have only a native conjugate gradient method, minimize written by Carl Rasmussen. I agree that L-BFGS-B is the best choice overall, possibly with a few restarts to avoid getting stuck in a high-noise mode. — lacerbi, Mar 28 '16 at 01:18
@lacerbi: I too remember the CG algorithm when I first read GPML ( ~ first half of '09 - GPML TB ver. 2.1). I think this change happened in ver. 3.0 (~ second half of '10). You can see this if you check the older versions. — usεr11852, Mar 28 '16 at 02:39
Thanks -- I didn't realized that when I updated to more recent versions. I've been using MATLAB's optimization toolbox's fminunc of fmincon. — lacerbi, Mar 28 '16 at 03:19

Most suitable optimizer for the Gaussain process likelihood function

1 Answers1

Linked