Conventional wisdom for designing and training neural networks

Question

Are there any tutorials or guides on conventional wisdom for designing neural networks? For example, how do you pick:

the number of layers or number of units per layer?
an activation function?
a step size?
a regularization parameter?
the minibatch size?

I think the 3rd one should be like $1/L$ where $L$ is the Lipschitz constant of some convex function over the minibatch size, but I'm not entirely sure how that goes. (I'm just using beta/size of minibatch where beta is some fixed parameter less than 1, but from my few empirical experiments it looks like it also depends on number of layers / units in a layer?)

I know we can use cross validation for the 4th one, but conventional wisdom would also be helpful there.

Thanks!

you might be interested in this paper http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf but probably there are better/newer one. Number units and regularization depends on the desired complexity of your solution, which might be hard to quantified, therefore hard to have some guides for it. For activation functions, you can look at different ones and read when and why are they used. — rep_ho, Nov 03 '15 at 11:17
This is great, exactly what I'm looking for! I also found a short discussion in Duda, Stork, and Hart ch 6. As you say there may be much more recent ones but I don't know, NNs are so old that maybe these are recent enough! — Y. S., Nov 04 '15 at 15:53

MachineEpsilon · Answer 1 · 2016-06-30T02:34:42.113

In addition, to the paper by Le Cun outlined in the comments, two more recent practical guides by pioneers in the area are:

Practical Recommendations for Gradient-Based Training of Deep Architectures. Yoshua Bengio. 2012.
A Practical Guide to Training Restricted Boltzmann Machines. Geoffrey Hinton. 2010.

Both of these come from the deeplearning.net reading list, and they are also featured in a book "Neural Networks: Tricks of the Trade". Two other resources I've found helpful are:

Andrej Karpathy's course notes for Stanford 231n on Convolutional Neural networks esp. Part 1, Part 2, Part 3.
Stochastic Gradient Descent Tricks. Leon Bottou. 2012. This is written as an update to LeCun et al.'s paper

More specific recommendations might depend on the problem domain, the network architecture and the type of data you're working with.

score 1 · Answer 2 · answered Jul 07 '17 at 13:48

The advice given verbatim from Aurélien Géron' "Hands-On Machine Learning with Scikit-Learn and TensorFlow" on DNN Architecture:

- Initialization: He initialization  
- Activation function: ELU  
- Normalization: Batch Normalization  
- Regularization: Dropout  
- Optimizer: Adam  
- Learning rate schedule: None

You may tinker with this parameters depending on the size of your NN and if the speed or accuracy, whatever you define it, is your objective.

As far as batch size is concerned, you might be interested in this excellent discussion supported by links to academic papers at the CV question Tradeoff batch size vs. number of iterations to train a neural network

Conventional wisdom for designing and training neural networks

2 Answers2