8

Consider this example. Suppose we have three events to happen with probability $p_1=p_2=\frac{1}{2}\sin ^2\theta ,p_3=\cos ^2\theta $ respectively. And we suppose the true value $\theta _0=\frac{\pi}{2}$. Now if we do $n$ times experiments, we will only see event 1 and event 2 happen. Hence the MLE should be $$\mathrm{aug} \underset{\theta}{\max}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_1}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_2}$$ where $m_1$ is how many times event 1 happened and $m_2$ is how many times event 2 happened. Obviously, the solution to the above optimization problem(also it's our MLE) is $\pi/2$ which is a constant and has nothing to do with $m_1,m_2$.

But let's see what is our fisher information, whose inverse should give us a lower bound of the variance of any unbiased estimator: $$\begin{align} &\frac{\left[ \partial _{\theta}\frac{1}{2}\sin ^2\theta \right] ^2}{\frac{1}{2}\sin ^2\theta}+\frac{\left[ \partial _{\theta}\frac{1}{2}\sin ^2\theta \right] ^2}{\frac{1}{2}\sin ^2\theta}+\frac{\left[ \partial _{\theta}\cos ^2\theta \right] ^2}{\cos ^2\theta} \\ &=2\frac{\left[ \partial _{\theta}\frac{1}{2}\sin ^2\theta \right] ^2}{\frac{1}{2}\sin ^2\theta}+\frac{\left[ \partial _{\theta}\cos ^2\theta \right] ^2}{\cos ^2\theta} \\ &=4\cos ^2\theta +4\sin ^2\theta \\ &=4 \end{align}$$ Hence there's a conflict, where do I understand wrong?

Thanks in advance!

Edit If I calculate MLE not based on the observed data, but instead, based on all the possible outcomes before we do experiments, it should be: $$\mathrm{aug}\underset{\theta}{\max}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_1}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_2}\left[ \cos ^2\theta \right] ^{m_3}$$ taking ln we will have $$\left( m_1+m_2 \right) \left( 2\ln\sin \theta -\ln 2 \right) +2m_3\ln\cos \theta $$ taking derivative w.r.t. $\theta$ and set derivative to be zero we will get $$2\left( m_1+m_2 \right) \frac{\cos \theta}{\sin \theta}=2m_3\frac{\sin \theta}{\cos \theta}$$ Hence if true value $\theta_0=\pi/2$, $m_3$ will always be $0$, and we will always have a conclusion that $\hat\theta=\pi/2$ which will have no variance at all. Hence the variance of $\hat\theta$ is $0$ which against the rule of Cramer-Rao bound.

narip
  • 187
  • 12
  • @SextusEmpiricus Hello, by definition of Fisher information, it should be $\int_{\mathbb{R}}{\left( \frac{\partial}{\partial \theta}\log f(x;\theta ) \right) ^2}f(x;\theta ),dx$ and by discreting it and do derivative to the log term we will have it should be $\sum_x{\frac{\left[ \frac{\partial}{\partial \theta}p\left( x;\theta \right) \right] ^2}{p\left( x;\theta \right)}}$ which is the formula used in the post. – narip Oct 21 '22 at 00:07
  • @SextusEmpiricus We have $\frac{\partial}{\partial \theta}\log f(x;\theta )=\frac{\frac{\partial}{\partial \theta}f(x;\theta )}{f(x;\theta )}$, where log stands for natural logarithm. Hence $\left[ \frac{\frac{\partial}{\partial \theta}f(x;\theta )}{f(x;\theta )} \right] ^2f(x;\theta )=\frac{\left[ \frac{\partial}{\partial \theta}f(x;\theta ) \right] ^2}{f(x;\theta )}$. – narip Oct 21 '22 at 06:51

4 Answers4

4

The first issue you have here is that your likelihood function does not appear to match the description of your sampling mechanism. You say that you only observe events 1 and 2 happen, but if the sample size $n$ is known then this still fixes the number of times that event 3 happens (since $m_1 + m_2 + m_3 = n$). Taking an observation $\mathbf{m} \sim \text{Mu}(n, \mathbf{p})$ gives the likelihood function:

$$L_\mathbf{m}(\theta) = \bigg( \frac{1}{2} \sin^2 (\theta) \bigg)^{m_1 + m_2} \bigg( \cos^2 (\theta) \bigg)^{n - m_1 - m_2},$$

which gives the log-likelihood:

$$\ell_\mathbf{m}(\theta) = \text{const} + (m_1 + m_2) \log (\sin^2 (\theta)) + (n - m_1 - m_2) \log (\cos^2 (\theta)).$$

As you can see, the likelihood function has an extra term that you have not included in your question. We can see that $M_* = M_1+M_2 \sim \text{Bin}(n, \sin^2 (\theta))$ is a sufficient statistic in this problem so the problem is essentially one of binomial inference with a transformed probability parameter. With a bit of calculus it can be shown that the MLE solves:

$$\sin^2 (\hat{\theta}) = \frac{m_1 + m_2}{n} \quad \quad \quad \implies \quad \quad \quad \hat{\theta} = \text{arcsin} \bigg( \sqrt{\frac{m_1 + m_2}{n}} \bigg).$$

This estimator for the parameter is generally biased (it is unbiased in the case where $\sin^2 (\theta) = \tfrac{1}{2}$) so the applicable version of the Cramér–Rao lower bound in this case is the generalisation for biased estimators:

$$\mathbb{V}(\hat{\theta}) \geqslant \frac{|\psi'(\theta)|}{I(\theta)} \quad \quad \quad \quad \quad \psi(\theta) \equiv \mathbb{E}(\hat{\theta}).$$

The expectation function is:

$$\begin{align} \psi(\theta) &= \sum_{m=0}^n \text{arcsin} \bigg( \sqrt{\frac{m}{n}} \bigg) \cdot \text{Bin} (m|n,\sin^2 (\theta)) \\[6pt] &= \sum_{m=0}^n {n \choose m} \cdot \text{arcsin} \bigg( \sqrt{\frac{m}{n}} \bigg) \cdot \sin^{2m} (\theta) \cdot \cos^{2(n-m)} (\theta). \\[6pt] &= \frac{\pi}{2} \cdot \sin^{2n} (\theta) - \frac{\pi}{2} \cdot \cos^{2n} (\theta) + \sum_{m=1}^{n-1} {n \choose m} \cdot \text{arcsin} \bigg( \sqrt{\frac{m}{n}} \bigg) \cdot \sin^{2m} (\theta) \cdot \cos^{2(n-m)} (\theta), \\[6pt] \end{align}$$

and its derivative (which appears in the bound) is:

$$\begin{align} \psi'(\theta) &= \sum_{m=0}^n {n \choose m} \cdot \text{arcsin} \bigg( \sqrt{\frac{m}{n}} \bigg) \frac{d}{d\theta} \bigg[ \sin^{2m} (\theta) \cdot \cos^{2(n-m)} (\theta) \bigg] \\[6pt] &= n\pi \cdot \sin(\theta) \cos(\theta) [\sin^{2(n-1)} (\theta) + \cos^{2(n-1)} (\theta)] \\[6pt] &\quad + \sum_{m=1}^{n-1} {n \choose m} \cdot \text{arcsin} \bigg( \sqrt{\frac{m}{n}} \bigg) \sin^{2m-1} (\theta) \cos^{2(n-m)-1} (\theta) \\[6pt] &\quad \quad \quad \quad \quad \times \bigg[ 2m \cos^2 (\theta) - 2(n-m) \sin^2 (\theta) \bigg]. \\[6pt] &= n\pi \cdot \sin(\theta) \cos(\theta) [\sin^{2(n-1)} (\theta) + \cos^{2(n-1)} (\theta)] \\[6pt] &\quad + 2 \sum_{m=1}^{n-1} {n \choose m} \cdot \text{arcsin} \bigg( \sqrt{\frac{m}{n}} \bigg) \sin^{2m-1} (\theta) \cos^{2(n-m)-1} (\theta) (m - n \sin^2 (\theta)). \\[6pt] \end{align}$$

As you can see, this is a more complicated form for the Cramér–Rao lower bound. Nevertheless, it should hold in this problem.

Ben
  • 124,856
2

Starting from the definition of the Cramer-Rao Lower bound from here.

Suppose $\theta$ is an unknown deterministic parameter which is to be estimated from $n$ independent observations (measurements) of $x$, each from a distribution according to some probability density function $f(x;\theta )$. The variance of any unbiased estimator $\hat{\theta}$ of $\theta$ is then bounded by the reciprocal of the Fisher information $I(\theta )$:

$$\operatorname{var}\left(\hat {\theta }\right)\geq {\frac {1}{I(\theta )}}.$$

Deriving the Maximum Likelihood Estimator proceeds as the original poster suggested

taking derivative w.r.t. $\theta$ and set derivative to be zero we will get

$$2\left( m_1+m_2 \right) \frac{\cos \hat{\theta}}{\sin \hat{\theta}}=2(n-m_1-m_2)\frac{\sin \hat{\theta}}{\cos \hat{\theta}}.$$

Re-arranging gives us

$$\hat{\theta} = \tan^{-1} \sqrt{\frac{m_1+m_2}{n-m_1+m_2}}.$$

Although the proof is not straightforward, I have verified this with simulations, and it comes as no surprise based on the functional form of the estimator:

$$\operatorname{E}\left(\hat{\theta}\right) \neq \theta.$$

The MLE for $\theta$ is biased. Therefore, the standard Cramer-Rao lower bound does not hold. There is a generalized lower bound for biased estimators, but I did not compute it.

The full equation is:

$$\operatorname{E}\left(\hat{\theta}\right) = \sum_{m_1 = 0}^{n-1} \sum_{m_2 = 0}^{n-m_1-1} \tan^{-1} \sqrt{\frac{m_1+m_2}{n-m_1+m_2}} {n\choose m_1}p_1^{m_1} (1-p_1)^{n-m_1} {n\choose m_2}p_2^{m_2} (1-p_2)^{n-m_2}$$

where $p_1 = p_2 = \frac{1}{2}\sin^2\theta$ and $m_1$ and $m_2$ are independent. $m_1 = \sum x_{1,i}$ and $m_2 = \sum x_{2,i}$ and $x_1, x_2, x_3 \sim Multinomial(\frac{1}{2}\sin^2\theta, \frac{1}{2}\sin^2\theta, cos^2\theta)$

The Variance can be calculated similarly:

$$\operatorname{Var}\left(\hat{\theta}\right) = \operatorname{E}\left(\hat{\theta}^2\right) - \left[\operatorname{E}\left(\hat{\theta}\right)\right]^2.$$

MLE of $\theta$ is consistent

As a side note, the MLE is consistent. As $n \rightarrow \infty,~~\operatorname{E}\left(\hat{\theta}\right) \rightarrow \theta.$ Although not a formal proof, you can guess that as $n \rightarrow \infty,$

$$m_1 \rightarrow np_1 = n\left(\frac{1}{2}\right)\sin^2\theta$$

$$m_2 \rightarrow n\left(\frac{1}{2}\right)\sin^2\theta$$

$$n-m_1-m_2 \rightarrow n\cos^2\theta$$

$$E\left(\hat{\theta}\right) \rightarrow \tan^{-1} \sqrt{\frac{\frac{n}{2}\sin^2\theta + \frac{n}{2}\sin^2\theta}{n\cos^2\theta}} = \theta.$$

A note on $\pi / 2$

As the original poster noted, if $\theta = \pi / 2$, then $m_3 = n - m_1 - m_2= 0$ because $\frac{1}{2} \sin^2 \theta = \frac{1}{2}$ and $\cos^2 \theta = 0.$ The estimator $\hat {\theta}$ is undefined for $\theta = \pi / 2.$

R Code

Here is R code to show how to calculate $\operatorname{E}\left(\hat{\theta}\right)$ and $\operatorname{V}\left(\hat{\theta}\right)$

> theta <- pi / 4
> n <- 10
> e_theta_hat <- 0
> e_theta_hat_square <- 0
> for (m1 in 0:(n-1))
+  {
+    for (m2 in 0:(n-m1-1))
+    {
+      e_theta_hat <- e_theta_hat + atan(sqrt((m1+m2)/(n-m1-m2)))*dbinom(m1, n, 0.5*(sin(theta))^2)*dbinom(m2, n, 0.5*(sin(theta))^2)
+      e_theta_hat_square <- e_theta_hat_square + (atan(sqrt((m1+m2)/(n-m1-m2))))^2*dbinom(m1, n, 0.5*(sin(theta))^2)*dbinom(m2, n, 0.5*(sin(theta))^2)
+    }
+ }
> e_theta_hat
[1] 0.7662692
> theta
[1] 0.7853982
> 
> e_theta_hat_square - (e_theta_hat)^2
[1] 0.04779768

> theta <- pi / 4 > n <- 1000 > e_theta_hat <- 0 > e_theta_hat_square <- 0 > for (m1 in 0:(n-1))

  • {
  • for (m2 in 0:(n-m1-1))
  • {
  • e_theta_hat &lt;- e_theta_hat + atan(sqrt((m1+m2)/(n-m1-m2)))*dbinom(m1, n, 0.5*(sin(theta))^2)*dbinom(m2, n, 0.5*(sin(theta))^2)
    
  • e_theta_hat_square &lt;- e_theta_hat_square + (atan(sqrt((m1+m2)/(n-m1-m2))))^2*dbinom(m1, n, 0.5*(sin(theta))^2)*dbinom(m2, n, 0.5*(sin(theta))^2)
    
  • }
  • }

> e_theta_hat [1] 0.7853983 > theta [1] 0.7853982 > > e_theta_hat_square - (e_theta_hat)^2 [1] 0.0003755647 >

R Carnell
  • 5,323
1

Variance of your MLE estimator should be computed while still considering the results to be random, i.e.:

$$ \hat{\theta} = \mathrm{arg} \underset{\theta}{\max}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_1}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_2} \left[ \cos ^2\theta \right] ^{m_3} $$ $$ Var(\hat{\theta}) = Var \left( \mathrm{arg} \underset{\theta}{\max}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_1}\left[ \frac{1}{2}\sin ^2\theta \right] ^{m_2} \left[ \cos ^2\theta \right] ^{m_3} \right) = \dots $$

where $m_3$ is the number of observation of the event 3. You can't just rule out $\left[ \cos ^2\theta \right] ^{m_3}$ just because you didn't observe the event 3 after the experiment. If you don't know the value of $\theta$, you must consider the possibility of event 3 happening.

Even if after doing $n$ experiments you would observe some events 3, i.e. $m_3 > 0$, the evaluation of the $\mathrm{arg}\max$ above would be a constant (if $m_1$, $m_2$, $m_3$ are some constant numbers). That would not mean the variance of the estimator is $0$ - we just evaluated the estimate accoring to results of experimenting, there is no randomness in it.

What is the variance of the estimator $\hat{\theta}$ of the random variable $\theta$? To compute it, you'd need some assumption about $\theta$ (e.g. which distribution does it come from?). However, what you know by the Cramer-Rao theorem, is that the variance will be higher or equal than $1/4$.

druskacik
  • 111
  • Thank you for your answer. But it seems the Wikipedia explanation tends to say MLE is based on the given observations. See wikipedia's formula: ${\displaystyle f_{n}(\mathbf {y} ;\theta )=\prod {k=1}^{n},f{k}^{\mathsf {univar}}(y_{k};\theta )~.}$ i.e. the product of the observed events' probability. – narip Oct 18 '22 at 12:45
  • 1
    Don't confuse estimator and estimate. Estimator is a statistic, a function of data, estimate is evaluation of the estimator for given datapoints. Estimator is random and has a variance, estimate is not random. Result of the MLE method is an estimate, an evaluation of the estimator function for given points produced by the experiment. See also: https://stats.stackexchange.com/questions/7581/what-is-the-relation-between-estimator-and-estimate – druskacik Oct 18 '22 at 13:05
  • I don't think that the problem is with a wrong computation of the variance of the estimator. When $p_1 = p_2 =1/2$ and $p_3 = 0$ then the estimates will always be $\hat p_1 = \hat p_2 =1/2$ and $ \hat p_3 = 0$ and the variance of the estimator should be zero. The problem is that the Cramer Rao bound can not be applied to this case. – Sextus Empiricus Oct 21 '22 at 09:18
1

TLDR

This factor 1/4 is a correct result. The asymptotic variance of the estimator $\sin^2 \theta$ is equal to the factor 1/4n. (and for smaller sample sizes you need to use the bound that is based on a non-zero bias)

R-code demonstration

set.seed(1)
n = 200
p = rbinom(10^6,n,0.25)/n
t = asin(sqrt(p))
var(t)*n ### 0.2518232 ~ 1/4
var(p)*n ### 0.1873885 ~ 3/8

The problem occurs when $\sin^2 \theta = 0$ or $\sin^2 \theta = 1$ as in your example case. The relationship with the Cramer-Rao bound breaks down because the information is infinite for $p=0$ or $p=1$ and it becomes undefined in the case of the parameterization $p = \sin(\theta)^2$, which has an infinite derivative and results in an undefined division of zero by zero.


Derivation of information for simpler case

To make the equations easier I am gonna use only two states instead of three. $p_1 = f(\theta)$ and $p_2 = 1-f(\theta)$

You apply the formula $$\mathcal{I}(\theta) = \sum_x{{ \overbrace{\left[ \frac{\frac{\partial}{\partial \theta} p\left( x;\theta \right) }{p\left( x;\theta \right) }\right] ^2}^{{logL^\prime}^2}}{p\left( x;\theta \right)}} \qquad \tag{1} $$

This derivative of the log likelihood is:

$$logL^\prime = \frac{ m_1 f^\prime(\theta) - m_2 f^\prime(\theta)}{m_1 f(\theta) + m_2 (1-f(\theta))}$$

The probability is

$$p(x;\theta) = f(\theta)^{m_1} (1-f(\theta))^{m_2}$$

And if we restrict to a single observation $m_1 = 1, m_2 = 0$ or $m_1 = 0, m_2 = 1$ then the sum in formula (1) becomes

$$\mathcal{I}(\theta) = \frac{f^\prime (\theta)^2}{f(\theta)} + \frac{f^\prime (\theta)^2}{1-f(\theta)} = \frac{f^\prime (\theta)^2}{f(\theta) (1-f(\theta))} $$

then

$$\mathcal{I}(\theta)^{-1} = \frac{f(\theta) (1-f(\theta))}{f^\prime (\theta)^2}$$

and with $f(\theta) = \sin^2\theta$ this becomes the same as your result

$$\mathcal{I}(\theta)^{-1} = \frac{ \sin^2\theta \cos^2\theta}{(2 \cos\theta \sin\theta )^2} = \frac{1}{4}$$

But with $f(\theta) = \theta$ then you get

$$\mathcal{I}(\theta)^{-1} = \theta(1-\theta)$$

like in this question https://math.stackexchange.com/questions/396982/fisher-information-of-a-binomial-distribution

The two are related by the factor $$\left(\frac{\text{d} \theta}{\text{d} p}\right)^2 = \left(\frac{\text{d}\, \text{sin}^{-1}(p^{0.5})}{\text{d} p}\right)^2 = \frac{1}{4(1-p)p}$$.