28

I wonder if there is always a maximizer for any maximum (log-)likelihood estimation problem? In other words, is there some distribution and some of its parameters, for which the MLE problem does not have a maximizer?

My question comes from a claim of an engineer that the cost function (likelihood or log-likelihood, I am not sure which was intended) in MLE is always concave and therefore it always has a maximizer.

Thanks and regards!

Tim
  • 19,445
  • 9
    (+1) Are you sure there are not some qualifications that have gone unstated in your question? As it stands, the engineer's statement is false in so many different ways it's almost hard to know where to begin. :) – cardinal Oct 10 '11 at 00:30
  • @cardinal: I basically wrote down what I heard. But I admit I may miss something. – Tim Oct 10 '11 at 00:40
  • 7
    Counterexample (convexity): Let $X_1,X_2,\ldots,X_n$ be iid $\mathcal N(0,\sigma^2)$. Though there is a unique MLE, neither the likelihood nor log-likelihood is convex in $\sigma^2$. – cardinal Oct 10 '11 at 03:08
  • 4
    @Tim Logistic regression is a basic example where the MLE does not always exist. In addition, for some link functions the log-likelihood is not concave. –  Jun 01 '12 at 14:30

5 Answers5

35

Perhaps the engineer had in mind canonical exponential families: in their natural parametrization, the parameter space is convex and the log-likelihood is concave (see Thm 1.6.3 in Bickel & Doksum's Mathematical Statistics, Volume 1). Also, under some mild technical conditions (basically that the model be "full rank", or equivalently, that the natural parameter by identifiable), the log-likelihood function is strictly concave, which implies there exists a unique maximizer. (Corollary 1.6.2 in the same reference.) [Also, the lecture notes cited by @biostat make the same point.]

Note that the natural parametrization of a canonical exponential family is usually different from the standard parametrization. So, while @cardinal points out that the log-likelihood for the family $\mathcal{N}(\mu,\sigma^2)$ is not convex in $\sigma^2$, it will be concave in the natural parameters, which are $\eta_1 = \mu / \sigma^2$ and $\eta_2 = -1/\sigma^2$.

DavidR
  • 1,707
  • 2
    (+1) Nice answer. As hinted to in my comments to the OP, this is the answer I was hoping would be posted (even the counterexample was carefully chosen with this in mind). :) – cardinal Oct 10 '11 at 11:28
  • 2
    Can you show this in Multivariate Gaussian Model? – Royi May 26 '17 at 06:22
8

Likelihood function often attains maximum for estimation of parameter of interest. Nevertheless, sometime MLE does not exist, such as for Gaussian mixture distribution or nonparametric functions, which has more than one peaks (bi or multi -modal). I often face the problem of estimating population genetics unknown parameters i.e., recombination rates, effect of natural selection.

One of the reason also @cardinal point out that is unbounded parametric space.

Moreover, I would recommend the following article, see section 3 (for function) and Fig.3. However, there is quite useful and handy document information about MLE.

Biostat
  • 1,989
  • 3
    I think I must be misunderstanding your stated example. What quadratic functions have more than one peak? – cardinal Oct 10 '11 at 03:12
  • @cardinal: Let me try to explain. You're point about unbounded parameter is one of reason that likelihood function does not attain the maximum even in simple example of normal distribution. However, my point is from optimization perspective that there is a popular problem of local and global maxima. I faced this problem often in population genetics while estimating recombination rates. Moreover see this article section 3 (for function) and Fig 3.

    article URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.671&rep=rep1&type=pdf

    – Biostat Oct 10 '11 at 11:34
  • So are you saying "quadratic functions with more than one peak" is a reference to, e.g., a Gaussian mixture model, perhaps? If so, an edit could probably clear up some confusion. – cardinal Oct 10 '11 at 11:37
  • Now it is updated. – Biostat Oct 10 '11 at 12:25
  • 4
    (+1) For the update. Note that in Gaussian mixture models both unbounded likelihood and multiple local maxima are present, in general. To make matters worse, the likelihood becomes unbounded at particularly pathological solutions. In general, multiple maxima may not be as bad of an issue. In some cases, these maxima converge to one another fast enough that picking any of them can still yield a reasonable (even, efficient) estimator of the parameter of interest asymptotically. – cardinal Oct 10 '11 at 13:23
5

Perhaps someone will find the following simple example useful.

Consider flipping a coin once. Let $\theta$ denote the probability of heads. If it is known that the coin can come up either heads or tails then $\theta \in (0,1)$. Since the set $(0,1)$ is open, the parameter space is not compact. The likelihood for $\theta$ is given by $$ \begin{cases} \theta & \text{heads} \\ 1-\theta & \text{tails} \end{cases} . $$ In neither case is there a maximum for $\theta$ on $(0,1)$.

mef
  • 3,226
3

I admit I may be missing something, but --

If this is an estimation problem, and the goal is to estimate an unknown parameter, and the parameter is known to come from some closed and bounded set, and the likelihood function is continuous, then there has to exist a value for this parameter that maximizes the likelihood function. In other words, a maximum has to exist. (It need not be unique, but at least one maximum must exist. There is no guarantee that all local maxima will be global maxima, but that isn't a necessary condition for a maximum to exist.)

I don't know whether the likelihood function always has to be convex, but that isn't a necessary condition for there to exist a maximum.

If I've overlooked something, I'd welcome hearing what it is that I am missing.

D.W.
  • 6,668
  • 5
    Absent additional assumptions, the statement given regarding maxima is false. For example, if the parameter space is closed and bounded and the likelihood function is continuous in the parameters, then a maximum must exist. Absent either of these additional conditions, the result need not hold. Regarding convexity, it fails even in the most simple and common of examples. :) – cardinal Oct 10 '11 at 02:54
  • 2
    (+1) Boundedness of the parameter space doesn't hold in a lot of simple cases, even. But, for practical purposes, we generally know our parameters are bounded. :) – cardinal Oct 10 '11 at 13:25
0

Re: a comment by @Cardinal.

I suspect those that referred to Cardinal's comment below the original question, missed its subtle mischief: the OP makes a problematic statement, because it refers to the likelihood as a cost function (which we usually want to minimize), and then talks about concavity and maximization. The Gaussian likelihood is certainly not convex in $\sigma^2$.

Moreover, the log-likelihood is not "always concave": the easiest example is the Student-t distribution whose log-likelihood is not concave, or the Gamma log-likelihood for shape parameter $<1$ (it is convex then).

But even more than that, there is another subtle point, i.e. that what we want is for the log-likelihood to be concave at the stationary point:

The sample likelihood of i.i.d. zero-mean Normals is concave at the stationary point. To wit,

$$f_x(x) = \frac 1 {\sqrt{2\pi}} \frac 1 \sigma \exp\left\{-\frac 12 \frac{x^2}{\sigma^2}\right\}.$$

To work with $\sigma^2$ set $\sigma^2 \equiv v$ and take the log-sample likelihood.

$$\log \mathcal L = -\frac{n}{2}\ln(2\pi) -\frac{n}{2} \ln v -\frac 12 \frac{\sum_{i=1}^nx_i^2}{v} $$

$$\frac{\partial \log \mathcal L}{\partial v}= -\frac{n}{2}\frac 1 v + \frac 12 \frac{\sum_{i=1}^nx_i^2}{v^2}$$

$$\frac{\partial^2 \log \mathcal L}{\partial v^2}= \frac{n}{2}\frac 1 {v^2} - \frac{\sum_{i=1}^nx_i^2}{v^3} = \frac{n}{2v^2}\left(1- \frac 2 v \frac 1n \sum_{i=1}^nx_i^2\right).$$

Evidently, this is neither concave nor convex. On its own it can be either, depending on the sample of $x$'s and on the true value of $v$. But at the stationary point we have $$\hat v = \frac 1n \sum_{i=1}^nx_i^2$$

and inserting this into the f.o.c we get

$$\frac{\partial^2 \log \mathcal L}{\partial v^2} |_{{\rm f.o.c.}}= -\frac{n}{2\hat v^2} <0.$$

So the sample log-likelihood of a sample of i.i.d. zero-mean Normals is concave at the stationary point related to $\sigma^2$.