Gradient descent with random kicks on non-convex functions

Question

I've recently learned gradient descent and it clearly gets stuck at local minimums when applied to non-convex functions.

Can't we just randomly kick the values in between steps when iterating?

Kind of like quantum tunneling. That would drastically increase the probability to reach global minimum.

I think there are several methods that deal with this problem. The first one to come ot my mind when reading this is simulated annealing. A google search of simulated annealing might be useful to you . Also i like what is in the wikipedia article as it explains the method in a simple way and makes the analogy to metallurgy.. — Michael R. Chernick, Jan 02 '17 at 14:31
Depending on your optimization problem (functions used and size, difficulty), you may be able to use a "real" global optimizer, such as BARON, to find the global optimum, and avoid heuristic stuff like simulated annealing. Gradient descent should only be used when absolutely necessary ... and not even then. Do you have example problems of interest, or is this a generic question. — Mark L. Stone, Jan 02 '17 at 14:34
Despite Mark L Stone's comment you might still benefit from learning about simulated annealing. Your problem may be one with many local minima. CV has lots of discussions about simulated annealing. You can do a search to see over 100 posts. — Michael R. Chernick, Jan 02 '17 at 14:41
@ Michael Chernick Simulated annealing has its place, such as on problems for which real global optimization can't be applied (within time and resource or capability constraints), or on problems for which simulated annealing has been shown to work well, which is certainly not all problems. — Mark L. Stone, Jan 02 '17 at 14:54
@MarkL.Stone this was a generic question. You're right, if we can implement gradient algorithm, in almost all cases we know normal equation (least squares solution) which will give a much better result much quicker. — A__A, Jan 02 '17 at 19:21
I don't know what you're talking about now. You mention normal equation, which is a terrible way to solve a least squares problem, but only applies to linear least squares, which is a convex problem. Nonlinear least squares may be either convex or non-convex, but is not solved by normal equation, but potentially could be solved by Karush-Kuhn-Tucker conditions. So what are you taking about? It sounds like you learned about least squares solutions from Andrew Ng videos, which present the two worst solution methods. — Mark L. Stone, Jan 02 '17 at 19:52
Well, I'm actually reading myself and have barely worked with non-convex functions. I actually mentioned that I'm a newbie in the question but it got edited. — A__A, Jan 02 '17 at 20:05
Buy this book "Numerical Optimization" by Nocedal and Wright http://www.springer.com/us/book/9780387303031 and get "Convex Optimization" by Boyd and Vandenberghe"http://stanford.edu/~boyd/cvxbook for free. Read them both, and work some problems, and you'll really know something about optimization. And forget those stupid videos on youtube. f course, you need to know some about math to get through these books, but you don't have to be able to follow every detail. — Mark L. Stone, Jan 02 '17 at 20:43

score 2 · Accepted Answer · answered Jan 02 '17 at 15:11

{2} mentions to well-known methods:

ITERATIVE HILL-CLIMBING: This method is devised by combining hill-climbing and random search. Once one peak has been found, the hillclimbing process is repeated again at another randomly selected location. This search is performed in isolation hence, it takes no account of the overall idea of the shape of the problem domain.

SIMULATED ANNEALING: Invented in 1982 by Kirkpatrick , it is essentially a modified version of hill-climbing. Starting from a random point in the search space, a random move is made. If the move takes us to a higher point, it is accepted, otherwise it is accepted with the probability decresaing with time. The probabilty starts from 1, decreasing towards zero as time passes. As such, any move is accepted at the begining, but as the probability continues to decrease, the probability of accepting negative moves are lowered. Negative moves are essential sometimes, if local maxima are to be escaped, but too many negative moves would lead the search away from the maxima. Again this method deals with one candidate one at a time and so does not build an overall picture of the search space, and no information from previous moves is used to guide the selection of new moves. This technique has been successful in many applications, for example VLSI circuit layout.

You can also use an evolutionary algorithm and add some hill climbing (= gradient ascent if your function is differentiable), as done in {1}:

References:

{1} Su, Shih-Chieh, Cheng-Jian Lin, and Chuan-Kang Ting. "An effective hybrid of hill climbing and genetic algorithm for 2D triangular protein structure prediction." Proteome science 9, no. 1 (2011): 1. https://scholar.google.com/scholar?cluster=8852775032391122915&hl=en&as_sdt=0,22 ; https://doi.org/10.1186/1477-5956-9-S1-S19
{2} https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol2/hmw/article2.html

Gradient descent with random kicks on non-convex functions

1 Answers1