6

I know there is an analytic solution to the following problem (OLS). Since I try to learn and understand the principles and basics of MLE, I implemented the fisher scoring algorithm for a simple linear regression model.

$$ y = X\beta + \epsilon\\ \epsilon\sim N(0,\sigma^2) $$

The loglikelihood for $\sigma^2$ and $\beta$ is given by: $$ -\frac{N}{2}\ln(2\pi)-\frac{N}{2}\ln(\sigma^2)-\frac{1}{2\sigma^{2}}(y-X\beta)^{'}(y-X\beta) $$

To compute the score function $S(\theta)$, where $\theta$ is the vector of parameters $(\beta,\sigma^{2})^{'}$, I take the first partial derivatives with respect to $\beta$ and $\sigma^{2}$: \begin{align} \frac{\partial L}{\partial \beta} &= \frac{1}{\sigma^{2}}(y-X\beta)^{'}X \\[5pt] \frac{\partial L}{\partial \sigma^2} &= -\frac{N}{\sigma^{2}}+\frac{1}{2\sigma^{4}}(y-X\beta)^{'}(y-X\beta) \end{align}

Then the Fisher scoring algorithm is implemented as: $$ \theta_{j+1} = \theta_{j} - (S(\theta_{j})S(\theta_{j})^{'})S(\theta_{j}) $$

Please note, the following code is a very naive implementation (no stopping rule, etc.)

library(MASS)
x <- matrix(rnorm(1000), ncol = 2)
y <- 2 + x %*% c(1,3) + rnorm(500)

fisher.scoring <- function(y, x, start = runif(ncol(x)+1)){
    n <- nrow(x)
    p <- ncol(x)
    theta <- start
    score <- rep(0, p+1)
    for (i in 1:1e5){
        # betas
        score[1:p] <- (1/theta[p+1]) * t((y - x%*%theta[1:p])) %*% x
        # sigma
        score[p+1] <- -(n/theta[p+1]) + (1/2*theta[p+1]^2) * crossprod(y - x %*% theta[1:p])
        # new
        theta <- theta - MASS::ginv(tcrossprod(score)) %*% score
    }
    return(theta)
}

# Gives the correct result 
lm.fit(cbind(1,x), y)$coefficients
# Does not give the correct result
fisher.scoring(y, cbind(1,x))
# Even if you start with the correct values
fisher.scoring(y, cbind(1,x), start=c(2,1,3,1))

My Question

What did I miss? Where is my mistake?

Thomas
  • 502
  • 1
    A first glimpse shows $\frac{\partial L}{\partial\sigma^{2}}=-\frac{N}{{\color{red}2}\sigma^{2}}+\frac{1}{2\sigma^{4}}(y-X\beta)^{'}(y-X\beta) $. And $\theta_{j+1} = \theta_{j} {\color{red}+} (S(\theta_{j})S(\theta_{j})^{'})S(\theta_{j})$, though people would use a sum of squared first derivatives for each observation, not a squared sum. An example is here. – Randel Oct 10 '15 at 15:44

2 Answers2

5

I have fixed the code according to the suggestions by @Randel. Now it works, except $\sigma^2$ takes a very long time to converge.

Here is the code:

fisher.scoring <- function(y, x, start = runif(ncol(x)+1)){
  n <- nrow(x)
  p <- ncol(x)
  theta <- start
  score <- rep(0, p+1)
  for (i in 1:1000){
    # betas
    score[1:p] <- (1/theta[p+1]) * t((y - x%*%theta[1:p])) %*% x
    # sigma
    score[p+1] <- -(n/(2*theta[p+1])) + 
                   (1/(2*theta[p+1]^2)) * 
                   crossprod(y - x %*% theta[1:p])
    # new
    hessMat <- matrix(0,ncol=p+1,nrow = p+1)
    for(j in 1:n)
    {
      # Estimate derivative of likelihood for each observation
      estVec <- c((1/theta[p+1]) * 
                t((y[j] - x[j,]%*%theta[1:p])) %*% x[j,],
                -(n/(2*theta[p+1])) + (1/(2*theta[p+1]^2)) *
                crossprod(y[j] - x[j,] %*% theta[1:p]))
      # Add them up as suggested to get an estimate of the Hessian.
      hessMat <- hessMat + estVec%*%t(estVec)
    }
    theta <- theta + MASS::ginv(hessMat) %*% score
  }
  return(theta)
}

Now when I run the code I get the following:

> lm.fit(cbind(1,x), y)$coefficients
       x1        x2        x3 
2.0136134 0.9356782 2.9666921 
> fisher.scoring(y, cbind(1,x))
          [,1]
[1,] 2.0136126
[2,] 0.9356782
[3,] 2.9666917
[4,] 0.6534185

I ran it again with 5000 iterations instead of 1000 and then I get:

> fisher.scoring(y, cbind(1,x))
          [,1]
[1,] 2.0136133
[2,] 0.9356782
[3,] 2.9666920
[4,] 0.8962295

Hope this helps! $\sigma^2$ seems to be converging very slowly, I do not know why, I guess that is material for another question!

EDIT: Here you can see the convergence of the parameters. The number of iterations is on a log-scale. The regression coefficients take 30-40 iterations to converge, although the $\beta_1$ parameter overshoots and then comes down again, (I was not expecting to see that). The $\sigma^2$ parameter is converging rather fast as the other ones are converging, and then the convergence slows down a lot. I have no idea at the moment why this happens.

Convergence over iterations

EDIT2: There was an error in the update for $\sigma^2$. See Randel's answer for the correction.

Gumeo
  • 3,711
  • Is there a way to solve a linear equation instead of explicitly calculating a matrix inverse in your lest line of the outer loop? I believe this is almost always a better idea when possible. – Matthew Drury Oct 11 '15 at 15:08
  • 1
    Yes @MatthewDrury you can use the solve function. I assume the code will be somewhat faster. I did not want to make too many changes in OP's code, to make the changes clear. – Gumeo Oct 11 '15 at 15:15
  • 1
    Fair enough. While I'm at it, thanks to both you and the poster for the extremely clear and tidy code. – Matthew Drury Oct 11 '15 at 15:18
  • @MatthewDrury you are welcome :) – Gumeo Oct 11 '15 at 15:19
  • My experiment shows that the estimate for $\sigma^2$ does not change much through iterations. As in your graph, if only changes 0.05, much less as compared to $\beta$. So it's almost fixed to the starting value. – Randel Oct 11 '15 at 15:42
  • @Randel I assume the behaviour is very dependent on the initial values for the parameters. But when you try this, do you not observe this slowed down convergence of $\sigma^2$ when the other parameters have converged? – Gumeo Oct 11 '15 at 15:46
  • @Randel I tried to run it again with 20K iterations and is still does not converge. But the plot looks similar to the one in this post. – Gumeo Oct 11 '15 at 15:58
  • Yes, the only thing seems strange to me is the bias of $\sigma^2$. It seems converge if the definition of convergence is that the change between iterations is small, but the problem is that it does not converge to the true value. If you use a larger ylim, you will see there is not much change. – Randel Oct 11 '15 at 16:19
  • It appears that your 'Fisher scoring' is really gradient descent. It's a known fact that it doesn't always converge well (or even converge). There may be nothing you can do. – meh Nov 07 '17 at 14:54
  • @aginensky take a look at the next answer, it has the solution. – Gumeo Nov 07 '17 at 14:55
4

The short answer is that there is a bug in @guðmundur-einarsson's code.

For each observation, the score function for $\sigma^2$ is $$\frac{\partial L_i}{\partial \sigma^2} = -\frac{1}{2\sigma^{2}}+\frac{1}{2\sigma^{4}}(y_i-X_i\beta)^{'}(y_i-X_i\beta),$$ not $\frac{\partial L_i}{\partial \sigma^2} = -\frac{{\color{red}N}}{2\sigma^{2}}+\frac{1}{2\sigma^{4}}(y_i-X_i\beta)^{'}(y_i-X_i\beta)$. So just replace n in estVec with 1. Although it's not apparent, there're some clues.

  • The estimate of $\sigma^2$ weirdly does not change much through iterations, although it shows some trend when ylim is automatically set to be small. This small change is because each move is divided by about $N$.
  • As noted by @guðmundur-einarsson in the related question:

The convergence is rather fast in the beginning and then stagnates and continues very slowly when the regression parameters converge.

When $\beta$ does not converge (at round 4 of the x-axis in the above figures) , $\sigma^2$ depends on $\beta$, so it's changing faster as compared to that after $\beta$ converges. The log scale of iteration times seems also produce some artificial effect.

  • The reason why $\beta$ can still converge with the bug can be explained by its closed-form solution does not depend on $\sigma^2$ at all.

The following figures are based on 100 iterations and the random seed is set as 1. enter image description here

The lesson I learned from this is keeping debugging, although it's relatively easier in this small example.

Randel
  • 6,711
  • Thanks for pointing this out @Randel ! I deleted the other post to further reduce confusion for someone that might run into this. – Gumeo Oct 14 '15 at 07:27