4

I have made two solvers to implement neural networks, one is based on stochastic gradient descent (SGD) while the other is based on the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm.

I have read a lot of material and find it is common to use SGD rather BFGS, but I have found that BFGS performs better than SGD.

Can anyone can tell me why people prefer SGD to BFGS?

Glen_b
  • 282,281
maple
  • 299

2 Answers2

4

Neural networks are successful when you have huge training sets. In these situations training time is a big problem, and sgd is much faster than batch methods (and requires no memory unlike BFGS) see papers of leon bottou. So I think you are seeing good performance on toy problem which is not where nnets excel.

seanv507
  • 6,743
  • How about replace gradient descent with BFGS in the optimization of each minibatch? And in fact, my training set is 20GB, so I think it is big enough. – maple Aug 27 '15 at 07:37
0

Do all of the nodes in your network perform smooth operations? One reason for using a gradient based method as opposed to a quasi-Newton method is non differentiability which occurs with many common "activation functions" such as ReLU, and also with L1 regularization.

Also I believe there are some arguments for not using quasi-Newton methods in online training but this I don't know enough about.