Is it true that generally L-BFGS may not converge in non-convex settings even if learning rate is really small?
For example here L-BFGS diverges, but there are theoretical guarantees on its local convergence. How can this be explained?
Is it true that generally L-BFGS may not converge in non-convex settings even if learning rate is really small?
For example here L-BFGS diverges, but there are theoretical guarantees on its local convergence. How can this be explained?
Yes, it is true that the L-BFGS-B algorithm will not converge in the true global minimum even if the learning rate is very small.
Using a Quasi-Newton method means you try to find the optimal $\theta$ using an iterative scheme similar to: $\theta_{k+1} = \theta_{k} - \alpha S_k g_k$ where $\theta$ are the parameters you optimise for, $k$ indices the iteration you are in, $\alpha$ is your learning rate, $S$ is the "Hessian-like" matrix associated with the system $A$ you try to solve and $g$ is the gradient, usually $S_{k=0} = I$. (Notice that if $S = A^{-1}$ you get back the simple Newton's method and if $S = I$ you get back the standard steepest descent algorithms.)
Now as you see the learning rate $\alpha$ only enters this scheme in the way that the algorithm updates its current solution. Having a very small $\alpha$ will only make sure that you use the local gradient information more prominently to the expense of the Hessian information. To draw a parallel with the paper you cite, this is the reason that L-BFGS-B occasionally diverges in the optimization of the $L_2$ regularised regression conducted while the gradient descent always converges. A very low $\alpha$ guarantees you will eventually converge to a local minimum but at the expense of a low learning rate.
Having said all that, non-convexity implies (but does not equates to) the presence of multiple local minima. The scheme above guarantees that you will find one of them for every given $\theta_0$. Converging to a particular minimum though does not guarantee that you will converge to the global minimum, just that in the presence of many local ones you choose the one that is closer to your $\theta_0$. Notice that L-BFGS-B works best when used on an at least locally differentiable convex objective/loss function. This is why in the case of robust regression they fail (rather miserably); the Hessian information gets muddled up completely and the algorithms gets stuck. This is further emphasised in the authors nice plot $2(c)$ where you see that the gradient descent just bounces around like crazy after a while because even the first order gradient information is not very informative. As pointed out in the comments there are other update schemes alternative to BFGS (eg. SR-1) that can handle non-convexity in case that standard BFGS updates fail.