8

I'm trying to understand the difference between the Newton-Raphson technique and the Fisher scoring technique by calculating the first iteration for each method for a Bernoulli sample. (I know that in this case I can explicitly and immediately calculate $\pi_{mle}$ but I want to do it iteratively just to understand and see how each method converges).

Suppose I draw a coin $n=10$ times, the real parameter $\pi_t=0.3$ is unknown to me, and I got 4 heads, so $\bar{X}=0.4$.

The score function is:

$$u(\pi) = \frac{n\bar{X}}{\pi} - \frac{n(1-\bar{X})}{1-\pi}$$

The observed fisher information is:

$$J(\pi) = -\frac{n\bar{X}}{\pi^2} - \frac{n(1-\bar{X})}{(1-\pi)^2}$$

and the expected fisher information is:

$$I(\pi) = \frac{n\pi_t}{\pi^2} + \frac{n(1-\pi_t)}{(1-\pi)^2}$$

And note that we can simplify the expected fisher information only when we evaluate it at $\pi = \pi_t$, but we don't know where that is...

Now suppose my initial guess is $\pi_0 = 0.6$

Does Newton-Raphson simply go like this:

$$ \pi_1 = \pi_0 - u(\pi_0)/J(\pi_0)$$

?

And how does Fisher-scoring go?

$$ \pi_1 = \pi_0 + u(\pi_0)/I(\pi_0)$$

Note that it contains $\pi_t$ which we don't know! and we can't even replace $\pi_t$ with $\pi_{mle}$ as we don't know that either - that's exactly what we're looking for...

Can you please help showing me in the most concrete possible way those 2 methods? Thanks!

ihadanny
  • 3,300

1 Answers1

5

For Newton-Raphson, yes, we have

$$ \pi_1 = \pi_0 - u(\pi_0)/J(\pi_0).$$

For Fisher scoring, as you mentioned, there is unknown parameter ($\pi$) in the expected information $I(\pi)$. Given $I(\pi)=-E(J(\pi))=E[u(\pi)u^{'}(\pi)]$, we use the sample first derivative to approximate the expected second derivative $$\hat I(\pi_0) = \sum_i^n u_i(\pi_0)u_i^{'}(\pi_0),$$ where $u_i(\pi)= \frac{x_i}{\pi} - \frac{1-x_i}{1-\pi}$, and $x_i$ is the indicator of head for each draw. Then $$ \pi_1 = \pi_0 + u(\pi_0)/\hat I(\pi_0).$$ Note that we need large $n$ since the approximation is based on asymptotic theory.

I revised I_hat(pi) in @ihadanny's Python code. Now Newton-Raphson and Fisher scoring provide identical results.

import random
import numpy as np 

pi_t = random.random()
n = 1000
draws = [1 if x < pi_t else 0 for x in np.random.rand(n)]
x_bar = np.mean(draws)

def u(pi):
    return n*x_bar/pi - n*(1-x_bar)/(1-pi)
def J(pi):
    return -n*x_bar/pi**2 - n*(1-x_bar)/((1-pi)**2)
def I_hat(pi):
    x = 0
    for i in range(0, n): 
        x = x + (draws[i]/pi - (1-draws[i])/(1-pi))**2
    return x
def Newton(pi):
    return pi - u(pi)/J(pi)
def Fisher(pi):
    return pi + u(pi)/I_hat(pi)

def dance(method_name, method):
    print("starting iterations for: " + method_name)
    pi, prev_pi, i = 0.5, None, 0
    while i == 0 or (abs(pi-pi_t) > 0.001 and abs(pi-prev_pi) > 0.001 and i < 10):
        prev_pi, pi = pi, method(pi)
        i += 1
        print(method_name, i, "delta: ", abs(pi-pi_t))

dance("Newton", Newton)
dance("Fisher", Fisher)

Log Message
starting iterations for: Newton
Newton 1 delta:  0.00899203081545
Newton 2 delta:  0.00899203081545
starting iterations for: Fisher
Fisher 1 delta:  0.00899203081545
Fisher 2 delta:  0.00899203081545

Update

This is a special case that Newton-Raphson and Fisher scoring are identical, because $$\hat I(\pi)=\sum_i^n \left(\frac{x_i}{\pi} - \frac{1-x_i}{1-\pi}\right)^2= \frac{\sum_i^n x_i}{\pi^2} + \frac{(n-\sum_i^n x_i)}{(1-\pi)^2} = -J(\pi),$$ which just requires standard algebra.

Randel
  • 6,711
  • hmm.. thanks a lot for this - makes sense. However, I've implemented your exact statements: http://pastebin.com/m192UYs9 - Newton converges after 1-2 iterations, Fisher doesn't even come close after 10 iterations. Isn't it supposed to be the other way around?? I thought that Fisher is an improvement over Newton... – ihadanny Oct 24 '16 at 08:18
  • 1
    My bad. I revised the answer. Note that the previous answer resulted in $\pi_1 = \pi_0 + 1/u(\pi_0)$. – Randel Oct 24 '16 at 19:50
  • oh great! one last question before I accept - isn't it's strange that now both methods give exactly the same results every time I run the code? again - Fisher was supposed to be an improvement - it looks like that by using your (correct) approximation there's now no advantage to using Fisher over Newton and both methods are the mathematically equivalent :( – ihadanny Oct 25 '16 at 04:43
  • (and don't worry - I'll accept and give the bounty award before it expires, it's just that I was really hoping this question will help me understand the fundamental difference between the methods and currently it didn't get me there - it just looks like math gymnastics of the same thing to me) – ihadanny Oct 25 '16 at 04:46
  • I updated the answer with the proof why the two methods are identical in this special case. In my understanding, Fisher scoring does not require second derivatives but it needs large $n$. A good reference: Demidenko, Mixed Models: Theory and Applications with R. – Randel Oct 25 '16 at 15:35