1

I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful?

A large learning rate allows the model to explore a much larger portion of the parameter space. Small learning rates, on the other hand, can take the model a long time before it converges.

In which cases are small learning rates particularly useful?

Gilfoyle
  • 659
  • 4
    If the learning rate is "too large," then the optimization can diverge. https://stats.stackexchange.com/questions/364360/how-can-change-in-cost-function-be-positive/364366#364366 A learning rate that is "small" in absolute terms might be the largest values that doesn't exhibit instability. – Sycorax Jul 15 '21 at 19:52
  • @Sycorax is right. Some food for thought is that Smith 2017 suggests that increasing the batch size is preferable to decaying the learning rate, but perhaps increasing the batch size isn't feasible in situations where decreasing the learning rate (either a priori or via decay) is possible. – Galen Jul 15 '21 at 20:04

0 Answers0