Consider running gradient descent to minimize ERM/training error (empirical risk) and denote it with $ J(\theta, X, y)$ for data $X$,$y$ and parameter/function $ \theta $. Recall gradient descent:
$$ \theta^{(t)} := \theta^{(t-1)} - \gamma_{t-1} \frac{\partial J(\theta^{(t-1)}, X, y)}{ \partial \theta^{(t-1)}} $$
were $t$ denotes iterations.
My question is, does there always $\exists, \gamma_{t-1}$ (whether its fixed or a function of iterations $t$) such that gradient descent never chooses/updates parameters in such a way that the training error $J$ increases?
My intuition tells me that if $\gamma_{t-1}$ is sufficiently small, gradient descent should (provably) never increase the cost of the function you are optimizing. The reason I think this is that one can imagine having a really large step size $\gamma_{t-1}$, where we would bounce around the minimum, potentially overshooting and increase the cost. So my conjecture is that yes, such a step size $\gamma_{t-1}$ should exist but I wanted to confirm this with the community or even see a link to a prove of this claim (it seems if its true and gradient descent is so widely used, that this type of question probably was already studied) or maybe even maybe provide a proof if they know one.