6

I am getting different results (close but not exact the same) from R GLM and manual solving logistic regression optimization. Could anyone tell me where is the problem?

BFGS does not converge? Numerical problem with finite precision?

Thanks

# logistic regression without intercept
fit=glm(factor(vs) ~ hp+wt-1, mtcars, family=binomial())

# manually write logistic loss and use BFGS to solve
x=as.matrix(mtcars[,c(4,6)])
y=ifelse(mtcars$vs==1,1,-1)

lossLogistic <- function(w){
  L=log(1+exp(-y*(x %*% w)))
  return(sum(L))
}

opt=optim(c(1,1),lossLogistic, method="BFGS")

enter image description here

Haitao Du
  • 36,852
  • 25
  • 145
  • 242

1 Answers1

7

Short answer: Optimise harder.

Your loss function is fine, no numeric issues there. For instance you can easily check that:

all.equal( lossLogistic(coef(fit)), as.numeric(-logLik(fit)), 
           check.attributes = FALSE)
[1] TRUE

What happens is that you assume that optim's BFGS implementation can get as good as an routine that use actual gradient information - remember Fisher scoring is essentially a Newton-Raphson routine. BFGS converged ( opt$convergence equals 0) but the best of BFGS was not the best you could get because as you did not provide a gradient function the routine had to numerically approximate the gradient. If you used a better optimisation procedure that could use more gradient-like information you would get the same results. Here, because the log-likelihood is actually a very well behaved function I can even use a quadratic approximation procedure to "fake" gradient information.

library(minqa)
optQ= minqa::uobyqa(c(1,1),lossLogistic)
all.equal( optQ$par, coef(fit), check.attributes = FALSE)
[1] TRUE
all.equal( optQ$fval, as.numeric(-logLik(fit)), 
           check.attributes = FALSE)
[1] TRUE

It works.

usεr11852
  • 44,125
  • Thank you very much for the great answer. I also learned minqa from you! – Haitao Du Aug 14 '16 at 12:21
  • Hmmm. I had it in my brain that BFGS requires a gradient as input. Does it work out some approximation if you don't supply it? – Matthew Drury Aug 14 '16 at 16:19
  • 1
    BFGS does require gradient information but this information will be numerically approximated when not provided to optim. To quote optim's doc: "If (grad) is NULL, a finite-difference approximation will be used." I checked the code in optim.c where the finite differences are computed; it is a standard first-order centred difference scheme. Nice for something quick and dirty but clearly no match for a quadratic approximation routine. – usεr11852 Aug 14 '16 at 17:08
  • @hxd1011: I learned about it reading through the R-sig-ME e-mail list (ME stands for mixed effects). It was one of the major things that changed between the 0.999... and the 1.x-y versions of lme4. It started offering bobyqa and Nelder-Mead as optimisation options; this was quite contrary to the approach of lme where it used a combination of E-M steps followed by N-R iterations. – usεr11852 Aug 14 '16 at 17:24
  • @MatthewDrury: Thanks for pointing out the ambiguity regarding the current use of gradient information by optim's BFGS. I hope the minor edit and the comment following your observation clear out things more. – usεr11852 Aug 14 '16 at 17:40
  • 2 years later, I learned details about glm in this great answer! – Haitao Du May 07 '18 at 20:01