Why it is popular to use stochastic gradient descent in neural networks rather than the BFGS algorithm?

Question

I have made two solvers to implement neural networks, one is based on stochastic gradient descent (SGD) while the other is based on the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm.

I have read a lot of material and find it is common to use SGD rather BFGS, but I have found that BFGS performs better than SGD.

Can anyone can tell me why people prefer SGD to BFGS?

seanv507 · Answer 1 · 2015-08-27T07:27:47.043

4

Neural networks are successful when you have huge training sets. In these situations training time is a big problem, and sgd is much faster than batch methods (and requires no memory unlike BFGS) see papers of leon bottou. So I think you are seeing good performance on toy problem which is not where nnets excel.

edited Aug 27 '15 at 07:27

answered Aug 27 '15 at 06:46

seanv507

6,743

How about replace gradient descent with BFGS in the optimization of each minibatch? And in fact, my training set is 20GB, so I think it is big enough. – maple Aug 27 '15 at 07:37

score 0 · Answer 2 · answered Aug 27 '15 at 02:56

Do all of the nodes in your network perform smooth operations? One reason for using a gradient based method as opposed to a quasi-Newton method is non differentiability which occurs with many common "activation functions" such as ReLU, and also with L1 regularization.

Also I believe there are some arguments for not using quasi-Newton methods in online training but this I don't know enough about.

Why it is popular to use stochastic gradient descent in neural networks rather than the BFGS algorithm?

2 Answers2

Linked

Related