multinomial log likelihood: rounding error or estimation error?

Question

My understanding of the multinomial likelihood, with four states, ${A}$, ${B}$, ${C}$ and ${D}$ is that it can be expressed as:

$Likelihood = \frac{N!} {nA! nB! nC! nD!} P(A)^{nA} P(B)^{nB} P(C)^{nC} P(D)^{nD}. $

where

${N}$ = number of trials.

${nA}$ = number of trials resulting in State ${A}$

${nB}$ = number of trials resulting in State ${B}$

${nC}$ = number of trials resulting in State ${C}$

${nD}$ = number of trials resulting in State ${D}$

${P(A)}$ = probability of State ${A}$

${P(B)}$ = probability of State ${B}$

${P(C)}$ = probability of State ${C}$

${P(D)}$ = probability of State ${D}$

I generate a data set below in R expressing the four probabilities with a multinomial logit, then try searching the $-log$ of the $Likelihood$ for ${N}$.

I must be making a mistake somewhere because my search indicates that ${N = 79.5}$ instead of the true value of ${N = 80}$. I am hoping someone can point out what I am doing incorrectly. I am sure it is a silly mistake, but I cannot identify it. I know this is a fairly remedial question. Thank you for any help.

# Generate data set
N  <- 80     # number of trials
pA <- 0.30   # probability of obtaining State A
pB <- 0.45   # probability of obtaining State B
pC <- 0.10   # probability of obtaining State C
pD <- 0.15   # probability of obtaining State D
nA <- N * pA # number of trials resulting in State A
nB <- N * pB # number of trials resulting in State B
nC <- N * pC # number of trials resulting in State C
nD <- N * pD # number of trials resulting in State D

# define multinomial logit link parameters that result in defined probabilities
beta.x = log(-(pA*pC-pA) / (pC^2+(pB+pA-2)*pC+(pA-1)*pB-pA+1-pA*pB))
beta.y = log(-(pB*pC-pB) / (pC^2+(pB+pA-2)*pC+(pA-1)*pB-pA+1-pA*pB))
beta.z = log(((1-pB)*pC) / (((pB+pA-1)*pC+pB^2+(pA-2)*pB-pA+1)-(pA*pC)))

# calculate -log-likelihood for a range of N's
N.trial <- seq((nA+nB+nC+0.01), (N+20), by = 0.01)
llh <- log( (factorial(N.trial) / (factorial(nA) * factorial(nB) * factorial(nC) * factorial((N.trial - (nA+nB+nC))))) *
         (    exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^nA *
         (    exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^nB *
         (    exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^nC *
         (1 - exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) - 
              exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) - 
              exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^(N.trial - (nA+nB+nC)) )
llh <- -1 * llh

# identify number of trials, N, that corresponds to minimum value of -llh
my.data <- data.frame(N.trial = N.trial, llh = llh)
my.data[my.data$llh == min(my.data$llh), ]

score 1 · Accepted Answer · answered Jan 05 '18 at 03:47

From the previous comments if I understand the model structure, the log of the likelihood is given by

$$\log{L}=-\log ((n-n_a-n_b-n_c)!)+\log (n!)-\log (n_a!)-\log (n_b!)-\log (n_c!)+$$ $$2 (n-n_a-n_b-n_c) \log (1-p)+2 n_a \log (p)+(n_b+n_c) \log (1-p)+(n_b+n_c) \log (p)$$

where $n_a$, $n_b$, and $n_c$ are the known counts and $n$ and $p$ are parameters to be estimated. We have the expectations of $n_a$, $n_b$, and $n_c$ as $n p^2$, $n p(1-p)$, $n p(1-p)$, and $n(1-p)^2$.

There is an explicit maximum likelihood estimator for $p$ when $n$ is known:

$$\hat{p}={{2n_a+n_b+n_c}\over{2n}}$$

but it does not appear that there is an explicit closed-form solution for $n$. However, the maximum likelihood estimates are easily programmed in R (or just about any other language):

  logL = function(parms, na, nb, nc) {

  # Log likelihood function
  n = parms[1]
  p = parms[2]
  2*(n - na - nb - nc)*log(1 - p) + (nb + nc)*log(1 - p) +
    2*na*log(p) + (nb + nc)*log(p) + 
    lfactorial(n) - lfactorial(na) - lfactorial(nb) - 
    lfactorial(n - na - nb - nc) - lfactorial(nc)
  }

  mle = function(na, nb, nc) {

  # Find maximum likelihood estimates
  # Initial estimates
  n0 = (na + nb)*(na + nc)/na   # Lincoln-Peterson estimator
  p0 =  (2*na + nb + nc)/(2*n0) # MLE of p when n is known
  solution = optim(c(n0,p0), logL, na = na, nb = nb, nc = nc,
    lower=c(na+nb+nc,0), upper=c(Inf,1), hessian=TRUE,
    control=list(fnscale=-1), method="L-BFGS-B")
  solution

  # Covariances and standard errors
  covmat = -solve(solution$hessian)
  se.nhat = covmat[1,1]^0.5
  se.phat = covmat[2,2]^0.5
  corr.nhat.phat = covmat[1,2]/(se.nhat*se.phat)

  # Return results
  list(nhat=solution$par[1], phat=solution$par[2], se.nhat = se.nhat,
    se.phat=se.phat, corr=corr.nhat.phat, solution=solution)  
  }

  # Example
  na = 24 # Number of animals caught in both visits
  nb = 36 # Number of animals caught in visit 1 but not visit 2
  nc = 8  # Number of animals caught in visit 2 but not visit 1
  mle(na, nb, nc)

Thanks, Jim. I will study this. N is usually given as n1 * n2 / m where n1 is the number of animals captured on the first visit, n2 is the number of animals captured on the second visit and m is the number of animals captured on both visits. However, I was trying to use maximum likelihood to estimate N instead of the closed form equation. I will study your reply. Thanks again. — Mark Miller, Jan 05 '18 at 04:12
I think that's what I did as $n_1=n_a+n_b$, $n_2=n_a+n_c$, and $m=n_a$. — JimB, Jan 05 '18 at 04:35
JIm, I m going to accept your answer. Although your likelihood and mine are identical and give identical estimates when the same constraints are applied. — Mark Miller, Jan 05 '18 at 18:54
Thanks. And I'm pretty convinced that there is no closed-form solution for the maximum likelihood estimate of $N$. — JimB, Jan 05 '18 at 19:37
Are you saying you do not accept N = n1 * n2 / m? I realize we should not have extended discussion in the comments. However, if you do not accept N = n1 * n2 / m I would like to go over this with you, maybe in chat. — Mark Miller, Jan 05 '18 at 19:53
n1 * n2 / m is definitely the standard (and explicit) estimator. All I'm saying is that it is not the maximum likelihood estimator as your question was about finding the maximum likelihood estimator. (At least, that's my interpretation of what you were wanting.) — JimB, Jan 05 '18 at 20:09
I appreciate it, Jim. Let me mull everything over for a few days and I may have some thoughtful questions next week. — Mark Miller, Jan 06 '18 at 00:49

JimB · Answer 2 · 2018-01-05T00:52:53.863

I agree with you that there is no round-off issue.

But your answer is different from 80 because there's no reason to believe one should get the same value that minimizes a similar likelihood equation but modified with different values of $N$ and therefore $nD=N-nA-nB-nC$ - especially non-integer values of $N$. Even restricting values of $N$ to integers one has a different data set and comparing likelihoods from different datasets is a bit unorthodox at best.

The multinomial logit link parameters are totally unnecessary. The llh function can use pA, pB, and pC directly and obtain exactly the same results.

This can be simplified be considering a binomial with $p=1/3$ and $x=5$ for the number of successes. What value of $N$ maximizes the likelihood? (It would be helpful for a description as to why one would want to do this or even be able to do this as when would we know the probability of success and the number of successes but not the number of trials? I'm not saying there might not be a good reason for this but only that no rationale is given and I think that would be helpful.)

Here is some R code to produce a variety of log likelihoods (but for the simplified binomial case):

# Set a range of values of n
n = c(100:200)/10

# Calculate the log likelihood
logL = lfactorial(n) - lfactorial(x) - lfactorial(n-x) + x*log(p) + (n-x)*log(1-p)

# Value of n that maximizes the log likelihood
n[c(1:101)[logL==max(logL)]]

The resulting value that maximizes the likelihood is 14.5. And in this case just considering integers both $N=14$ and $N=15$ maximize the likelihood.

Thanks, Jim. I intentionally avoided my rational for this exercise. However, the 4-sided die problem with an unknown number of trials is, to me, identical to the Lincoln-Petersen method of estimating animal abundance. With the LP you never know the number of trials (the true number of animals). You only know the number of animals detected on both visits (State A), the number of animals detected only on the first visit (State B), and the number of animals detected only on the second visit (State C). You never know the number of animals not detected on either visit (State D). — Mark Miller, Jan 05 '18 at 01:20
With the Lincoln-Petersen you also know the probability of State A, the probability of State B, and the probability of State C. Or you can estimated those probabilities using closed form equations. My goal here was to program a maximum-likelihood version of the Lincoln-Petersen to estimate animal abundance, (the number of trials). — Mark Miller, Jan 05 '18 at 01:24
If $p$ is the probability of capturing any particular individual on a single visit (and the population does not change: i.e., a closed population), then the probabilities for State A, B, and C $p^2$, $p(1-p)$, $p(1-p)$, and $(1-p)^2$, respectively but one wouldn't know $p$. So one would be estimating both $N$ and $p$. Or do you really know the value of $p$? — JimB, Jan 05 '18 at 02:23

Mark Miller · Answer 3 · 2018-01-05T18:52:29.660

Edit

Rounding error is not the issue. The issue is inherent bias in the likelihood with small sample size. I have accepted JimB's answer. Although, his likelihood and mine are identical and return identical estimates when the same constraints are applied.

Earlier answer

Yesterday I answered my own question. That answer is retained further below. I suggested the reason the estimated $N$ of $79.496$ was different from the true $N$ of $80$ was because of rounding error in the betas. I no longer believe that because if I remove the betas from the likelihood I still get the incorrect answer of $79.496$, as shown with the following R code. Either my likelihood is wrong or there is some fundamental bias in a model of die rolls. I am baffled as to why I cannot accurately estimate the true $N$ of $80$.

# generate some data
N  <- 80     # number of trials
pA <- 0.30   # probability of obtaining State A
pB <- 0.45   # probability of obtaining State B
pC <- 0.10   # probability of obtaining State C
pD <- 0.15   # probability of obtaining State D
nA <- N * pA # number of trials resulting in State A
nB <- N * pB # number of trials resulting in State B
nC <- N * pC # number of trials resulting in State C
nD <- N * pD # number of trials resulting in State D

# estimate N with log-likelihood
my.function <- function(betas){
     N.est = betas[1]
     nA = 24
     nB = 36
     nC =  8
     pA = 0.30
     pB = 0.45
     pC = 0.10

     llh <- (  log(factorial(N.est)) - 
              (log(factorial(nA)) + log(factorial(nB)) + log(factorial(nC)) + log(factorial(N.est - (nA+nB+nC)))) +
              (nA                   * log(pA) +
               nB                   * log(pB) +
               nC                   * log(pC) +
               (N.est - (nA+nB+nC)) * log(1 - (pA+pB+pC)))  )
     -1 * llh
}

# estimate of N does not match true value of N
my.output <- optim(c((nA+nB+nC)), my.function, method="Brent", lower = 0, upper = 200, hessian=TRUE)
my.output$par
#[1] 79.49601

nlm(my.function, c(80))$estimate
#[1] 79.49597

Here is my answer from yesterday:

I think I have figured out why my search algorithm was returning $N$ = 79.5 instead of the true value of $N$ = 80. I think it was due solely to rounding error in the betas used in the multinomial logistic link.

Originally I obtained the betas using closed-form equations I derived here:

https://math.stackexchange.com/questions/2587203/invert-multinomial-logit-link-with-three-unknown

Then I estimated betas via multinomial logistic regression:

N  <- 80     # number of trials
pA <- 0.30   # probability of obtaining State A
pB <- 0.45   # probability of obtaining State B
pC <- 0.10   # probability of obtaining State C
pD <- 0.15   # probability of obtaining State D
nA <- N * pA # number of trials resulting in State A
nB <- N * pB # number of trials resulting in State B
nC <- N * pC # number of trials resulting in State C
nD <- N * pD # number of trials resulting in State D
nA; nB; nC; nD

require(nnet)
my.state <- c(rep('nA', nA), rep('nB', nB), rep('nC', nC), rep('nD', nD))
my.state <- relevel(as.factor(my.state), ref = "nD")
multinom(my.state ~ 1)

# 
# weights:  8 (3 variable)
# initial  value 110.903549 
# final  value 98.827745 
# converged
# Call:
# multinom(formula = my.state ~ 1)
# 
# Coefficients:
#    (Intercept)
# nA   0.6931183
# nB   1.0985900
# nC  -0.4054899
# 
# Residual Deviance: 197.6555 
# AIC: 203.6555 
#

These multinomial logistic regression betas differ from my closed-form betas in the fifth decimal place, a difference which initially I thought was trivial. However, that difference turns out not to be trivial, as shown below.

Next I compared the counts obtained using my closed-form betas against the true counts, and I compared the counts obtained using the multinomial logistic regression betas against the true counts. My closed-form betas returned counts closer to the true counts. The difference in counts obtained using the two sets of betas are small, but again not trivial:

# These betas are from my closed-form equations
beta.x =   0.6931472
beta.y =   1.098612
beta.z =  -0.4054651

# check counts in each class using my closed form betas
(N *      exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))) - nA
(N *      exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))) - nB
(N *      exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))) - nC
(N * (1 - exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) -
          exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) -
          exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))) - nD

# These betas are obtained using multinomial logistic regression
mr.beta.x =   0.6931183
mr.beta.y =   1.0985900
mr.beta.z =  -0.4054899

# check counts in each class using multinomial logistic regression betas
(N *      exp(beta.x) / (1 + exp(mr.beta.x) + exp(mr.beta.y) + exp(mr.beta.z))) - nA
(N *      exp(beta.y) / (1 + exp(mr.beta.x) + exp(mr.beta.y) + exp(mr.beta.z))) - nB
(N *      exp(beta.z) / (1 + exp(mr.beta.x) + exp(mr.beta.y) + exp(mr.beta.z))) - nC
(N * (1 - exp(beta.x) / (1 + exp(mr.beta.x) + exp(mr.beta.y) + exp(mr.beta.z)) -
          exp(beta.y) / (1 + exp(mr.beta.x) + exp(mr.beta.y) + exp(mr.beta.z)) -
          exp(beta.z) / (1 + exp(mr.beta.x) + exp(mr.beta.y) + exp(mr.beta.z)))) - nD

If I use the multinomial logistic regression betas in my search algorithm the estimated $N$ is 82:

# These betas are from multinomial logistic regression
beta.x =   0.6931183
beta.y =   1.0985900
beta.y =  -0.4054899

N.trial <- seq((nA+nB+nC+10), (N+2), by = 0.001)

llh.a <- log( (factorial(N.trial) / (factorial(nA) * factorial(nB) * factorial(nC) * factorial((N.trial - (nA+nB+nC))))) *

         (    exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^nA *
         (    exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^nB *
         (    exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^nC *
         (1 - exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) - 
              exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) - 
              exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)))^(N.trial - (nA+nB+nC)) )

llh.a <- -1 * llh.a

llh <- (  log(factorial(N.trial)) - 

         (log(factorial(nA)) + log(factorial(nB)) + log(factorial(nC)) + log(factorial((N.trial - (nA+nB+nC))))) +

         (nA                     * log(      exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))) +
          nB                     * log(      exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))) +
          nC                     * log(      exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))) +
          (N.trial - (nA+nB+nC)) * log( (1 - exp(beta.x) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) - 
                                             exp(beta.y) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z)) -
                                             exp(beta.z) / (1 + exp(beta.x) + exp(beta.y) + exp(beta.z))))) )

llh <- -1 * llh

# identify number of trials, N, that corresponds to minimum values of -llh
my.data <- data.frame(N.trial = N.trial, llh.a = llh.a, llh = llh)
my.data[my.data$llh.a == min(my.data$llh.a), ]
#     N.trial    llh.a      llh
#4001      82 25.27071 25.27071
my.data[my.data$llh   == min(my.data$llh),   ]
#     N.trial    llh.a      llh
#4001      82 25.27071 25.27071

The betas obtained via my closed-form equations allowed the algorithm to return an estimate of $N$ within 0.504 of the true value, while the betas obtained from multinomial logistic regression resulted in an estimate of $N$ that differed from the true value by 2.

This is only one data set. Maybe if I try other data sets the beta returned by the two different methods will not differ as much. I guess it makes sense that my closed form equations performed better because there is no real estimation going on to obtain them.

I feel a little bad that rounding or estimation error in the fifth decimal place of the betas can have such a fairly big impact on the estimate on $N$. I do not know how to improve on the estimate. However, I guess a difference of 0.504 from the true value of $N$ is not real bad.

If you'd like to have further discussions in a chat room, I'd be happy to do so. I do have to disagree with your wording in the latest edit. While maximum likelihood estimators are many times biased (and can be extremely biased) in such count models, that's not the reason why you don't get a value of 80. You don't get a value of 80 because the process you were performing is not a maximum likelihood process for the model you're trying to fit. What you were doing was assuming all parameters but $N$ were known AND the counts were the expected counts. One needs to average over all possible... — JimB, Jan 05 '18 at 21:45
arrangements to assess bias (with each arrangement weighted by the probability of occurrence). — JimB, Jan 05 '18 at 21:46
Thanks, Jim. I would enjoy talking about this more. Although, perhaps I need to think about this some first. I just obtained Chapman (1951) who identified bias in the Lincoln-Petersen with small sample size and I guess I assumed that was what I was seeing. The Chapman paper is in an obscure outlet but I think is cited in the Mark-recapture Wikipedia article. — Mark Miller, Jan 05 '18 at 23:24
Chapman, D.G. (1951). "Some properties of the hypergeometric distribution with applications to zoological sample censuses". Univ. Calif. Public. Stat. 1, 131-60. — Mark Miller, Jan 05 '18 at 23:26

multinomial log likelihood: rounding error or estimation error?

3 Answers3