In the Adadelta paper, the first proposed idea, idea 1 seems to me exactly like RMSprop (here or here), although it is not called like that and not referenced. Am I correct?
2 Answers
Referring the links pointed out by you, RMSprop is focussed on updating the learning rate $\eta$ for each iteration using accumulation of square of gradients: $r_{t}=\rho r_{t-1} + (1-\rho)g_{t}^{2}$ [where g is the gradient] and plugging to find the effective learning rate at step t using: $\eta_{t}=\frac{\eta}{\sqrt[]{r_{t}+\epsilon}}$ [where epsilon is the smoothing constant].
On the other hand Adadelta (Concentrating solely on Idea 1) does not focus on updating the learning rate for each step at all. The paper explains that the accumulation of square of gradients($r_{t}=\rho r_{t-1} + (1-\rho)g_{t}^{2}$) can be approximated by RMS of gradient:
RMS[$g_{t}$]=$\sqrt{(r_{t}=\rho r_{t-1} + (1-\rho)g_{t}^{2})+\epsilon}$
and then describes how parameter update can be handled using learning rate $\eta$ (note the learning rate here is not step dependant). Update step from Idea 1 in adadelta: $\delta x_{t}=-\frac{\eta}{RMS[g_{t}]}g_{t}$[where $x_{t}$ is the parameter to be updated].
The two methods RMSprop and Adadelta differ from each other even at Idea 1. Further down (Idea 2) Adadelta shows why learning rate constant is not important for this method of optimisation at all. The learning rate is only used for the initial step in update of parameters and later the learning rate has a relationship with accumulative updates. This however is another discussion as our OP was only concerning the Idea 1 of Adadelta.
- 309
- 1
- 10
-
1I understand what you mean by "idea 1does not focus on updating the learning rate for each step at all." but IF I were to simply stop after idea 1, and use it to update my weights would it then be same as RMSprop? I believe this is also what the op means. – A.D Dec 14 '16 at 17:22
Yes, you are correct.
Like you, I also arrived at the same conclusion by examining Idea 1 (section 3.1) in the Adadelta paper and the lecture.
Anyway, here is some more evidence:
Sebastian Ruder wrote in his popular blog post An overview of gradient descent optimization algorithms:
RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above [...]
("the first update vector" refers to the implementation of Idea 1, which is described earlier in the post.)
Indeed, the Adadelta paper was published in 2012, and the lecture was first given in 2012, so it makes perfect sense that both were unaware of each other, and thus neither referenced the other.
Also, I searched the many comments to Sebastian's post, and I didn't find anyone challenging him about the claim I quoted.
Recently I also made the same claim in an answer on stackoverflow, and wasn't challenged about it. (Though obviously my answer is less visited than Sebastian's post by some orders of magnitude, so this evidence is much weaker.)
- 1,332
- 13
- 26