I think this is an open-ended question because a lot will depend on the actual dataset you are optimising against, how close your first candidate solution $s_0$ is to a local optimum and if you are interested / are able to use derivative information or not.
I have used R's standard optim function and generally I have found that the L-BFGS-B algorithm is the fastest or close to the fastest from the default optimisation algorithms available. That is when I supply a derivative function. In the GPML Matlab Code the authors also provide a L-BFGS-B implementation so I suspect they too found that the L-BFGS-B algorithm is reasonably competitive when someone provides derivative information within the context of a general application.
Another option is to use derivative free optimisation. Rios and Sahinidis, 2013 review paper: "Derivative-free optimization: A review of algorithms and comparison of software implementations" in the Journal of Global Optimisation seems to be your best bet for something exhaustive. Within R the minqa package that provide derivative-free optimization by quadratic approximation (QA) routines. The package contains some of Powell's most famous "optimisation children": UOBYQA, NEWUOA and BOBYQA. I have found UOBYQA to be the fastest of the three for toy problems despite Wikipedia general advice: "For general usage, NEWUOA is recommended to replace UOBYQA.". This is not very surprising, log-likelihoods are smooth functions with well-defined derivatives so NEWUOA might not a enjoy an obvious advantage. Again this shows that there is no silver-bullet. On that matter, I have played around with some Particle Swarm Optimisation (PSO) and Covariance Matrix Adaptation Evolution Strategy algorithms included in the R package hydroPSO and cmaes respectively but in general while faster and far more informative than the canned Simulated Annealing (SANN) in optim they were not remotely competitive in terms of speed with QA routines. Notice that estimating the hyper-parameters vector $\theta$ for a log-likelihood function is usually a smooth and (at least locally) convex problem so stochastic optimisation generally will not offer a great advantage.
To recap: I would suggest using L-BFGS-B with derivative information. If derivative information is hard to obtain (eg. due to complicated kernels functions) use quadratic approximation routines.