Gradient descent decreasing loss

Question

Is the following statement true: Gradient descent is guaranteed to always decrease a loss function.

I know that if the loss function is convex, then each iteration of gradient descent will result in a smaller loss, but does this statement hold for all loss functions? Could we design a loss function in such a way that performing gradient descent was not optimal?

@MatthewGunn The question does not speak to the value of the step size, so I'm assuming it could be anything. — Shrey, Oct 04 '18 at 22:31
I think you're asking two separate things. The first is essentially if GD works for all loss functions, i.e. is there a loss function where GD wouldn't work? In the second question you ask if there is a loss function where GD isn't optimal? The second is quite easy, in any non-convex loss function GD is dependent of its staring point. Imaging if that were true and GD was the optimal way to decrease any loss function imaginable. The first question is more interesting, but in my opinion you should give a bit more details (e.g. can the loss function be non-differentiable?) — Djib2011, Oct 04 '18 at 23:14

score 3 · Answer 1 · answered Nov 18 '19 at 03:25

Whether the loss decreases depends on your step size. The direction opposite the gradient is that in which you should move to most quickly decrease your loss. A step of gradient descent is given by:

$$\theta_t = \theta_{t-1} -\eta\nabla_\theta\mathcal{L}$$

where $t$ indexes the training iteration, $\theta$ is the parameter, $\eta$ is the step size, and $\mathcal{L}$ is the loss. When you perform gradient descent, you are essentially linearizing your loss function around $\theta$. If $\eta$ is too large, then $\mathcal{L}(\theta_{t-1} -\eta\nabla_\theta\mathcal{L})$ may be greater than $\mathcal{L}(\theta_{t-1}) -\eta\nabla_\theta\mathcal{L}$, indicating that you overshot the local minimum and increased the loss (see illustration below). Note that it is also possible to overshoot the minimum and still have a decrease in loss (not shown).

What did you use for plotting? To display TeX in R put it into expression and in matplotlib in $$. — Tim, Nov 18 '19 at 05:42

Gradient descent decreasing loss

1 Answers1