Convergence of L-BFGS in non-convex settings

Question

Is it true that generally L-BFGS may not converge in non-convex settings even if learning rate is really small?

For example here L-BFGS diverges, but there are theoretical guarantees on its local convergence. How can this be explained?

I think it's a useful question because this optimization algorithm is commonly used for learning parameters of machine learning models, which is of interest on this site. If it's a concern, maybe the question could be edited to mention implications along these lines. — user20160, Jun 12 '16 at 19:44
There is no known algorithm(s) that could solve non-convex optimization problems in polynomial time. — horaceT, Jun 12 '16 at 21:16
@justanothercoder could you please link a paper that analyzes local convergence of L-BFGS on nonconvex functions. I can't find anything for that. — Karl Alexius, Aug 30 '18 at 09:24

usεr11852 · Accepted Answer · 2016-06-13T06:12:48.820

6

Yes, it is true that the L-BFGS-B algorithm will not converge in the true global minimum even if the learning rate is very small.

Using a Quasi-Newton method means you try to find the optimal $\theta$ using an iterative scheme similar to: $\theta_{k+1} = \theta_{k} - \alpha S_k g_k$ where $\theta$ are the parameters you optimise for, $k$ indices the iteration you are in, $\alpha$ is your learning rate, $S$ is the "Hessian-like" matrix associated with the system $A$ you try to solve and $g$ is the gradient, usually $S_{k=0} = I$. (Notice that if $S = A^{-1}$ you get back the simple Newton's method and if $S = I$ you get back the standard steepest descent algorithms.)

Now as you see the learning rate $\alpha$ only enters this scheme in the way that the algorithm updates its current solution. Having a very small $\alpha$ will only make sure that you use the local gradient information more prominently to the expense of the Hessian information. To draw a parallel with the paper you cite, this is the reason that L-BFGS-B occasionally diverges in the optimization of the $L_2$ regularised regression conducted while the gradient descent always converges. A very low $\alpha$ guarantees you will eventually converge to a local minimum but at the expense of a low learning rate.

Having said all that, non-convexity implies (but does not equates to) the presence of multiple local minima. The scheme above guarantees that you will find one of them for every given $\theta_0$. Converging to a particular minimum though does not guarantee that you will converge to the global minimum, just that in the presence of many local ones you choose the one that is closer to your $\theta_0$. Notice that L-BFGS-B works best when used on an at least locally differentiable convex objective/loss function. This is why in the case of robust regression they fail (rather miserably); the Hessian information gets muddled up completely and the algorithms gets stuck. This is further emphasised in the authors nice plot $2(c)$ where you see that the gradient descent just bounces around like crazy after a while because even the first order gradient information is not very informative. As pointed out in the comments there are other update schemes alternative to BFGS (eg. SR-1) that can handle non-convexity in case that standard BFGS updates fail.

edited Jun 13 '16 at 06:12

answered Jun 12 '16 at 19:15

usεr11852

44,125

5

"Notice that L-BFGS-B and other Quasi-Newton algorithms require at least local convexity." Not true. SR1 ( https://en.wikipedia.org/wiki/Symmetric_rank-one ) Quasi-Newton (including L-SR1 and L-SR1-B) may be able to handle indefinite or even concave objective functions, even near or at the optimum, at least if used within a trust region framework. – Mark L. Stone Jun 12 '16 at 22:58
1

+1 @MarkL.Stone. Thank you for point it out. I was a bit sloppy. I will fix it. Having said that, 1. the fact you are within the trust region means that you accept that the function adheres to (possibly quadratic) assumptions and 2. the standard BFGS does not use SR1 update natively. Anyway, thanks again, I will change the particular sentence later tonight. – usεr11852 Jun 12 '16 at 23:46
2

SR1 is a different Quasi-Newton updating method than BFGS. BFGS never uses SR1 ("native" or otherwise). SR1 Quasi-Newton models a quadratic, but the quadratic model need not be convex. It could be convex, indefinite, or concave at various points along the solution path. I've used SR1 with trust region to minimize concave functions on which BFGS fails miserably. They were not locally convex anywhere. – Mark L. Stone Jun 13 '16 at 00:17
1

I would like to point out that "not converging" and "not converging to the true global minimum" are two very different criteria (although clearly one implies the other). – Cliff AB Jun 13 '16 at 00:23
@MarkL.Stone: Thanks for sharing your experience on this, I did not know that SR-1 is so flexible. (I just knew it as a more robust alternative to the BFGS - Sorry about the "native" word. I meant to say the original L-BFGS-B routine, clearly if one used SR-1 he wouldn't be using BFGS.) Cliff: Agreed; this why I said "converging to a particular minimum" instead of "not converging". – usεr11852 Jun 13 '16 at 01:01
I was actually referring to the first sentence of your answer. In addition, the statement "non-convexity implies the presence of multiple local minima" is not true; in particular, the class of quasi-convex are all unimodal, but as the name implies, not necessarily convex. – Cliff AB Jun 13 '16 at 04:45
@CliffAB: I wrote implies, not equates or means; I will edit it to make it obvious. Clearly a unique local minimum is not a property exclusively related to convex functions. Nevertheless in the context of the question where the OP asks for local convergence I think it it safe to assume multiple local minima is the main issue faced. Non-convexity on its own is an issue and can potentially lead to divergence but this doesn't strongly relate (for me at least) to the question of local convergence. – usεr11852 Jun 13 '16 at 06:12

Convergence of L-BFGS in non-convex settings

1 Answers1