For gradient descent, does there always a exist a step size such that the cost of the training error you are trying to minimize never increase?

Question

Consider running gradient descent to minimize ERM/training error (empirical risk) and denote it with $ J(\theta, X, y)$ for data $X$,$y$ and parameter/function $ \theta $. Recall gradient descent:

$$ \theta^{(t)} := \theta^{(t-1)} - \gamma_{t-1} \frac{\partial J(\theta^{(t-1)}, X, y)}{ \partial \theta^{(t-1)}} $$

were $t$ denotes iterations.

My question is, does there always $\exists, \gamma_{t-1}$ (whether its fixed or a function of iterations $t$) such that gradient descent never chooses/updates parameters in such a way that the training error $J$ increases?

My intuition tells me that if $\gamma_{t-1}$ is sufficiently small, gradient descent should (provably) never increase the cost of the function you are optimizing. The reason I think this is that one can imagine having a really large step size $\gamma_{t-1}$, where we would bounce around the minimum, potentially overshooting and increase the cost. So my conjecture is that yes, such a step size $\gamma_{t-1}$ should exist but I wanted to confirm this with the community or even see a link to a prove of this claim (it seems if its true and gradient descent is so widely used, that this type of question probably was already studied) or maybe even maybe provide a proof if they know one.

Are you willing to assume that $J$ is smooth (say twice continously differentiable) and convex? — Brian Borchers, Aug 26 '15 at 23:40
@BrianBorchers It would be awesome getting informed what happens if that is the case (smooth and convex), but I wouldn't want it to stop there though. — Charlie Parker, Aug 27 '15 at 00:35
Yes, under suitable conditions you can prove that gradient descent with a line search (that dynamically adjusts $\gamma$ in each iteration) cannot only ensure a decrease in each iteration but actually a sufficient decrease that ensures convergence to a minimum of the function. There are other stronger conditions that ensure that the method will converge with a small enough fixed step length. See for example the nonlinear optimization textbook by Nocedal and Wright. — Brian Borchers, Aug 27 '15 at 01:45

score 3 · Answer 1 · answered Sep 26 '16 at 10:57

assuming the error function is twice continuously differentiable and coercive (error goes to infinity as parameters theta go to infinity - eg if you L1/L2 norm regularisation on the parameters) then I believe the answer is yes, and I will sketch a proof, which hopefully identifies the key concepts involved.

I will drop $X,y$ to simplify notation

Coercivity basically allows you to look at a finite area in your parameter set (mathematical term compact set): calculate your error at parameter Theta=0, then you are only interested in thetas with regularisation norm less than error(Theta=0) [eg if $J(\theta)=error(\theta) + \alpha ||\theta||^2$ then we are only interested in $||\theta||^2\le error(0)/\alpha=:K$ since outside this region the regularisation term alone leads to higher $J(\theta)$].

Applying Taylor's theorem with remainder to the change $J(\theta^{(t)}) - J(\theta^{(t-1)})$ with stepsize $\gamma$:

$J(\theta^{(t)}) - J(\theta^{(t-1)}) = -\gamma \nabla J(\theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})+ \gamma^2 \nabla J(\theta^{(t-1)})^T H((1-\eta)\theta^{(t-1)}+\eta \theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})$

Here $H$ is the matrix of second derivatives wrt $\theta$ and $\eta$ is an unknown term strictly between 0 and 1 given to us by Taylor's theorem.

So to ensure this change is positive we require $\gamma < \frac {2}{max_{\|\theta\|^2\le K}\sigma(H)}$

where $\sigma(H)$ is the maximum eigenvalue of $H(\theta)$.

Because $\|\theta\|^2\le K$ is a compact set and $\sigma(H)$ is a continuous function of $\theta$ then a (finite) maximum exists over the region bounded by K (extreme value theorem), and so we can find a corresponding small enough step size.

score 1 · Answer 2 · edited Sep 15 '18 at 17:07

The way you have asked the question, no there is not. It should be trivial to construct an "easy" function and initial condition where you converge to the global optimum in one step, and then any $\gamma\not=0$ will cause the loss to increase. You can, of course, code your optimization routine to simply not follow the gradient beyond the point that it decreases your loss function (line search).

If you're studying this for machine learning, you should know going in that the ML community seems to have jettisoned most of the best practices from optimization because they don't scale well enough to large datasets. Often evaluating the loss function even once on the whole dataset is deemed too expensive to be worthwhile, let alone computing exact first and second gradients, doing line searches, etc... (Take this with a grain of salt; I'm new at ML)

For gradient descent, does there always a exist a step size such that the cost of the training error you are trying to minimize never increase?

2 Answers2