Edit: After a couple of days of googling, I have found no reference about using gradient descent (GD) methods to solve MLE for fixed distribution, not to mention the case where parameters may vary over the time. This is highly surprising to me. The only stuff I've found was on the GD methods used for MLE computation in case of supervised learning, e.g. linear of logistic regressions. But those methods do not seem to be directly applicable in my situation, as they focused on estimating of distribution of a target given features, where below there are no explicit features given. There's definitely a connection, but I am failing to see one.
Original:
In Auction problem I have asked about a problem where I am looking to estimate a fixed distribution parameters from data. There I have some parametrized distribution given by a CDF $F(\cdot|\theta)$, and at each step I do a draw by choosing $x$ which gives a success with probability $1 - F(x|\theta)$ and failure with probability $F(x|\theta)$.
Let's just say I have fixed $x$ and wait until the first time I get a successful trial. My MLE $\theta$ must maximize the log likelihood $$ L(x, t|\theta) = \log\left(F^{t-1}(x|\theta)\cdot (1 - F(x, \theta))\right) = (t-1)\log F(x|\theta) + \log (1 - F(x, \theta)) $$ where $t$ is the step at which I get a success. What I thought of is that instead of solving $\nabla_\theta L= 0$ I could use $\nabla_\theta L$ to as in gradient descent (GD) methods to update my estimate of the parameters by $\theta \to \theta + \nabla_\theta L\cdot s$ where $s$ is some step size. If the distribution is fixed, then after enough iterations I should be able to converge to the optimal $\hat\theta$. I have noticed however, that in such case I do not have to wait for the first success to make an update. Reason is: in the situation above, on each failure I would have an update of $$ \theta\to\theta + \nabla_\theta \log F(x|\theta) \cdot s $$ and on each success $$ \theta\to\theta + \nabla_\theta \log (1 - F(x|\theta)) \cdot s $$ which in case success happens after $t$ steps amounts to the very change of $\theta$ as if we would have just updated it in one go after $t$ steps. That's assuming that the learning step $s$ says constant.
I'd like to know more about this procedure. Most likely it is being used, since it allows improving one's knowledge of the best parameters on the go, also one could change $x$ at each step. Do it have a name, what are good references on it? Finally, in case the distribution $F$ may change over time, this procedure provides a way to constantly update our estimate of $\theta_t$. For which perhaps some adjustment of the learning rate $s$ may be needed.