1

Edit: After a couple of days of googling, I have found no reference about using gradient descent (GD) methods to solve MLE for fixed distribution, not to mention the case where parameters may vary over the time. This is highly surprising to me. The only stuff I've found was on the GD methods used for MLE computation in case of supervised learning, e.g. linear of logistic regressions. But those methods do not seem to be directly applicable in my situation, as they focused on estimating of distribution of a target given features, where below there are no explicit features given. There's definitely a connection, but I am failing to see one.


Original:

In Auction problem I have asked about a problem where I am looking to estimate a fixed distribution parameters from data. There I have some parametrized distribution given by a CDF $F(\cdot|\theta)$, and at each step I do a draw by choosing $x$ which gives a success with probability $1 - F(x|\theta)$ and failure with probability $F(x|\theta)$.

Let's just say I have fixed $x$ and wait until the first time I get a successful trial. My MLE $\theta$ must maximize the log likelihood $$ L(x, t|\theta) = \log\left(F^{t-1}(x|\theta)\cdot (1 - F(x, \theta))\right) = (t-1)\log F(x|\theta) + \log (1 - F(x, \theta)) $$ where $t$ is the step at which I get a success. What I thought of is that instead of solving $\nabla_\theta L= 0$ I could use $\nabla_\theta L$ to as in gradient descent (GD) methods to update my estimate of the parameters by $\theta \to \theta + \nabla_\theta L\cdot s$ where $s$ is some step size. If the distribution is fixed, then after enough iterations I should be able to converge to the optimal $\hat\theta$. I have noticed however, that in such case I do not have to wait for the first success to make an update. Reason is: in the situation above, on each failure I would have an update of $$ \theta\to\theta + \nabla_\theta \log F(x|\theta) \cdot s $$ and on each success $$ \theta\to\theta + \nabla_\theta \log (1 - F(x|\theta)) \cdot s $$ which in case success happens after $t$ steps amounts to the very change of $\theta$ as if we would have just updated it in one go after $t$ steps. That's assuming that the learning step $s$ says constant.

I'd like to know more about this procedure. Most likely it is being used, since it allows improving one's knowledge of the best parameters on the go, also one could change $x$ at each step. Do it have a name, what are good references on it? Finally, in case the distribution $F$ may change over time, this procedure provides a way to constantly update our estimate of $\theta_t$. For which perhaps some adjustment of the learning rate $s$ may be needed.

SBF
  • 425
  • I think you may have made an error in your question. The function you say you want to maximise includes the gradient operator. Presumably what you mean is that you want to maximise the log-likelihood function (without the presence of the gradient operator) and you are later using that operator to find the critical points of the function? – Ben Apr 26 '23 at 22:47
  • @Ben you’re correct, I’ll fix that in the OP. what do you think about the described approach? – SBF Apr 27 '23 at 06:23

1 Answers1

4

I think you may be overcomplicating the situation here, by failing to recognise the kinds of optimisation problems and standard methods that arise in these cases. As presently written, your optimisation problem is a standard problem that can be solved using standard calculus methods and numerical methods. The partial derivatives of your objective function are:

$$\begin{align} \frac{\partial ML_{x,t}}{\partial \theta_k}(\theta) &= \frac{t-1}{F(x|\theta)} \cdot \frac{\partial F}{\partial \theta_k}(x|\theta) - \frac{1}{1-F(x|\theta)} \cdot \frac{\partial F}{\partial \theta_k}(x|\theta) \\[12pt] &= \bigg[ \frac{t-1}{F(x|\theta)} - \frac{1}{1-F(x|\theta)} \bigg] \frac{\partial F}{\partial \theta_k}(x|\theta) \\[12pt] &= \frac{1}{F(x|\theta) (1-F(x|\theta))} \bigg[ (t-1) (1-F(x|\theta)) - F(x|\theta) \bigg] \frac{\partial F}{\partial \theta_k}(x|\theta) \\[12pt] &= \frac{1}{F(x|\theta) (1-F(x|\theta))} \bigg[ (t-1) - t F(x|\theta) \bigg] \frac{\partial F}{\partial \theta_k}(x|\theta) \\[12pt] &= \frac{t}{F(x|\theta) (1-F(x|\theta))} \bigg[ \frac{t-1}{t} - F(x|\theta) \bigg] \frac{\partial F}{\partial \theta_k}(x|\theta), \\[12pt] \end{align}$$

which means that the gradient of the objective function is:

$$\nabla_\theta ML_{x,t} (\theta) = \frac{t}{F(x|\theta) (1-F(x|\theta))} \bigg[ \frac{t-1}{t} - F(x|\theta) \bigg] \nabla_\theta F(x|\hat{\theta}).$$

The critical points of the function occur when:

$$F(x|\hat{\theta}) = \frac{t-1}{t} \quad \quad \text{or} \quad \quad \nabla_\theta F(x|\hat{\theta}) = 0.$$

In this particular case the first condition gives you an explicit solution for the critical points, and the second condition can be solved using standard iterative root-finding methods like Newton-Raphson. There is no need for any novel iterative method, and it is not clear that your proposed methods would work properly, or outperform existing methods even if it works.


Let's step back from your specific problem and look at general forms of optimisation problems that you might encounter. In the linked answers to your related question, we find that your problem leads to an optimisation of the form:

$$\text{Maximise } f(\mathbf{x}) \quad \quad \quad \text{for } 0 < x_1 \leqslant \cdots \leqslant x_n < 1.$$

This is an example of a constrained optimisation; there are a number of ways you can solve it.

In one of the answers in the linked question, there is an analytical solution derived that first computes the unconstrained solution (assuming that the required ordering holds in this solution) and then asks what would happen if the ordering does not hold in the solution. This leads to a solution form where the user finds out that there are certain conditions leading adjacent parameters to be equal, and derives an explicit form for the solution using standard calculus methods.

In another of the answers in the linked question (my answer), an alternative method is deployed, whereby the input parameter for the optimisation is written as a smooth function of an underlying unconstrained parameter vector. This then allows the optimisation to be rewritten as an unconstrained optimisation, using the chain rule to find relevant derivatives. The unconstrained optimisation is then solved using ordinary optimisation methods (including numerical methods to iterate towards the solution) and the optima is then converted back to be stated in terms of the constrained parameter. This latter method is less accurate than an explicit solution, but it is a more general method that can be deployed on a wide range of objective functions.

Ben
  • 124,856
  • Thanks! I may indeed overcomplicate things by having very limited practical experience with such problems. Yet, I am having a problem at hand where solving a $\nabla_\theta = 0$ may not be easy (hence gradient descent comes as a natural way to solve it), or that parameters may be time-varying (same gradient descent may also tackle that case). That's what the OP about. Would that be correct to assume that you would use different methods in my case? – SBF Apr 28 '23 at 05:46
  • I think the key thing to remember here is that this part of statistics is just math --- it is just a constrained or unconstrained optimisation problem, and so you can deploy any mathematical technique that is suitable for the relevant optimisation problem. Statistics does not have any special rules here; it makes use of all the standard optimisation techniques available in all relevant fields of mathematics. – Ben Apr 28 '23 at 08:22
  • This is true indeed, but I'm kinda confused by the fact that I don't see GD being described as an online optimization for MLE. Perhaps there are some obvious reasons why one should not try that, or that it does not work, or anything of that kind – SBF Apr 28 '23 at 08:31
  • The main reason you don't see GD used much is because in most statistical problems the log-likelihood is twice-differentiable and has a simple (and easily computable) form for this second derivative, which means we can apply Newton's method to get a better iteration that takes account of the curvature of the function. Newton's method is known to converge more quickly than GD when it is available. – Ben Apr 28 '23 at 10:33
  • and what about the case in which parameters may be time-varying? I've actually just implemented GD on time-varying parameters for the problem above and it definitely looks decent – SBF Apr 28 '23 at 10:47