1

Is the following statement true: Gradient descent is guaranteed to always decrease a loss function.

I know that if the loss function is convex, then each iteration of gradient descent will result in a smaller loss, but does this statement hold for all loss functions? Could we design a loss function in such a way that performing gradient descent was not optimal?

Shrey
  • 205
  • 1
    Imagine the loss function $f(x) = 0$ – shimao Oct 04 '18 at 21:59
  • How is the step size ascertained? – Matthew Gunn Oct 04 '18 at 22:17
  • @MatthewGunn The question does not speak to the value of the step size, so I'm assuming it could be anything. – Shrey Oct 04 '18 at 22:31
  • I think you're asking two separate things. The first is essentially if GD works for all loss functions, i.e. is there a loss function where GD wouldn't work? In the second question you ask if there is a loss function where GD isn't optimal? The second is quite easy, in any non-convex loss function GD is dependent of its staring point. Imaging if that were true and GD was the optimal way to decrease any loss function imaginable. The first question is more interesting, but in my opinion you should give a bit more details (e.g. can the loss function be non-differentiable?) – Djib2011 Oct 04 '18 at 23:14

1 Answers1

3

Whether the loss decreases depends on your step size. The direction opposite the gradient is that in which you should move to most quickly decrease your loss. A step of gradient descent is given by:

$$\theta_t = \theta_{t-1} -\eta\nabla_\theta\mathcal{L}$$

where $t$ indexes the training iteration, $\theta$ is the parameter, $\eta$ is the step size, and $\mathcal{L}$ is the loss. When you perform gradient descent, you are essentially linearizing your loss function around $\theta$. If $\eta$ is too large, then $\mathcal{L}(\theta_{t-1} -\eta\nabla_\theta\mathcal{L})$ may be greater than $\mathcal{L}(\theta_{t-1}) -\eta\nabla_\theta\mathcal{L}$, indicating that you overshot the local minimum and increased the loss (see illustration below). Note that it is also possible to overshoot the minimum and still have a decrease in loss (not shown).

enter image description here