2

I'm trying to make sense of the following plot portraying the joint negative log-likelihood for the Binomial model $\mathrm{Binomial}(n,p)$ with unknown $n$ and $p$ and $y=5$. Note that the white part indicates a negative log-likelihood equal to minus infinity due to the fact that $n$ must be at least $y=5$. Also, I put an arbitrary cutoff at $n=50$.

enter image description here

The joint likelihood (still evaluated at $y=5$) has a clear maximizer while the negative log-likelihood has not.

enter image description here

Is it correct to say that this is the case because the joint likelihood is not log-concave?


Julia code to reproduce the plots above.

using Plots, StatsPlots, Distributions, LaTeXStrings, Random
Random.seed!(1994)

joint likelihood of n ad θ

begin ns = 1:50 θs = range(0., stop=1., length=1000) A = zeros(length(ns), length(θs)) for (i, n) in enumerate(ns), (j, θ) in enumerate(θs) A[i, j] = pdf(Binomial(n, θ), 5) end heatmap(A, xlabel=L"\theta", ylabel=L"n", xformatter=x->"$(x/length(θs))") savefig("plots/joint_likelihood.png") end

joint negative log-likelihood of n ad θ

begin ns = 1:50 θs = range(0., stop=1., length=1000) A = zeros(length(ns), length(θs)) for (i, n) in enumerate(ns), (j, θ) in enumerate(θs) A[i, j] = -loglikelihood(Binomial(n, θ), 5) end heatmap(A, xlabel=L"\theta", ylabel=L"n", xformatter=x->"$(x/length(θs))") savefig("plots/joint_neg_loglikelihood.png") end

Pietro
  • 173
  • 1
    Using ML with a single observation as here often leads to problems. The ML estimate of the two-parameter binomial does not always exist: see the end of my answer, where it is also explained that $n$ can not be considered as a continuous parameter. – Yves May 26 '21 at 17:51

1 Answers1

1

We can use profiling to make our lives easier here: the idea is that maybe we can't easily solve for $\max_{n,\theta} L(n, \theta \mid y)$ but for each particular $n$ we can find the $\theta$ that maximizes $L$, and then we can optimize that over $n$.

So let $L(n, \theta \mid y) = {n\choose y} \theta^y(1-\theta)^{n-y}$ be the likleihood that we're looking to maximize. The log likelihood is therefore $$ \ell(n,\theta\mid y) = y\log\theta + (n-y)\log(1-\theta) + \log{n\choose y}. $$ For each $n$ we can optimize this over $\theta$ by taking derivatives, which gives the usual MLE of $\hat\theta_n = \frac y n$.

This means that the profiled log likelihood is $$ \ell_p(n\mid y) = \ell(n, \hat\theta_n\mid y) = y\log (y/n) + (n-y)\log(1-y/n) + \log {n\choose y} $$ and we want to understand the maximum of this. To recap, if we think of the set of all valid $(n,\theta)$ pairs, this says that within each slice corresponding to a particular $n$ we know the maximum value of $\ell$ or $L$ will occur at $(n, \hat\theta_n)$. If we then optimize over each of these slice extrema, which now just depend on $n$, we will get the global optimum.

Using $0\cdot \log 0 = 0$ we can see that $$ \ell_p(y\mid y) = y \log 0 + 0\cdot \log 0 + \log 1 = 0 $$ and since the original likelihood is discrete, we know that it's bounded above by $1$ ,which means a log likelihood of zero is as big as we can get. This shows that if we are allowed to vary both $n$ and $\theta$ we will choose to set $n=y$ and $\hat\theta=1$ which makes the particular value we observed as likely as possible, i.e. it would happen with probability $1$.

That's often the trouble with doing this kind of joint optimization. We're basically able to vary both the mean and the variance and so we can make the particular sample that we got as likely as possible.

I'm not very familiar with Julia so I'm not sure if your plots are correct but this at least is the behavior that you should be seeing. It looks like the second plot shows this. Also we know that $(n, y/n)$ is the optimal point for each value of $n$ so that explains the $1/x$-looking shape in the second plot as well, starting at the maximum of $(y, 1)$ and fading away along the curve $(n, y/n)$.

jld
  • 20,228
  • thanks a lot for the thorough reply. So that I understand it, do you have some references to profiling? I've never seen it before and it looks rather intuitive as a technique. I'll be interested in reading about it.

    Also, can we say that the likelihood is not log-concave? It appears to me that this is the case but I don't have the mathematical knowledge to check this fact :) Thanks a lot in advance!

    – Pietro May 26 '21 at 19:03
  • 1
    @Pietro profiling comes up a lot with linear models where often we can express the MLE of some parameters in terms of other parameters and then our optimization is lower dimension. In particular with mixed models we can express the MLE of the fixed effects in terms of the random effect variance parameters, so we only need to numerically optimize over those variance parameters to get the full joint MLE. See here for example https://www.mathworks.com/help/stats/estimating-parameters-in-linear-mixed-effects-models.html – jld May 26 '21 at 19:17
  • 1
    Thanks a lot for the pointer and for the explanation - really appreciated jld! – Pietro May 26 '21 at 19:23
  • 2
    @Pietro you’re welcome! I’ll update later today on concavity as I have time – jld May 26 '21 at 19:50
  • Really appreciated jid - Thanks for sharing your knowledge! :) – Pietro May 26 '21 at 20:26
  • jld, if you have time, may you also expand on "since the original likelihood is discrete, we know that it's bounded above by 1". Thanks a lot in advance! – Pietro May 26 '21 at 21:27
  • 1
    @Pietro I can answer the last one real quick: in the discrete case $L(n,\theta\mid y)$ is actually a probability, not a more abstract density, so it’s always between 0 and 1 (even though it won’t necessarily sum/integrate to 1 since it’s a function of the parameters not the data) – jld May 26 '21 at 23:28
  • Thanks a lot for your answer jld! One quick check: when you say the $L(n, \theta|y)$ is discrete, you mean $L(n|\theta, y)$? That is the distribution that you get after applying profiling (fixing $\theta $ to its MLE given a specific $n$)? – Pietro May 27 '21 at 13:26
  • 1
    @Pietro I mean the random variable $Y$ that gives us this likelihood is discrete, so $L(n,\theta\mid y) = P(Y=y\mid n, \theta) \in [0,1]$. This applies to any valid choice of $n$ and $\theta$ so it also applies to the profiled likelihood – jld May 27 '21 at 14:09
  • Thank you very much for the answer @jld - this clarifies! :) – Pietro May 27 '21 at 15:58
  • As I am studying your answer, I'd have a question regarding one of your previous comments where you say "$L(n, \theta|y)$ is actually a probability, not a more abstract density, so it’s always between 0 and 1 (even though it won’t necessarily sum/integrate to 1 since it’s a function of the parameters not the data)" - how can the likelihood be interpreted as a probability if it does not sum/integrate to 1 (especially because in theory $n \in {y, +\infty}$). Thanks a lot for your time in advance and please do not feel obliged to reply - happy to ask it in another post -thanks a lot @jld :) – Pietro May 27 '21 at 16:06
  • 1
    @Pietro with a discrete RV the likelihood is equal to a probability for each pair $(n,\theta)$ since $L(n,\theta\mid y)=P(Y=y\mid n, \theta)$ but the likelihood only is guaranteed to sum/integrate to 1 over $Y$ since that corresponds to $P(Y \in \mathbb N)=1$. We’d need a prior on the parameters to get something equivalent for them – jld May 28 '21 at 19:35