Deep Neural Network weight initialization

Question

Given difficult learning task (e.g high dimensionality, inherent data complexity) Deep Neural Networks become hard to train. To ease many of the problems one might:

Normalize && handpick quality data
choose a different training algorithm (e.g RMSprop instead of Gradient Descent)
pick a steeper gradient Cost function (e.g Cross Entropy instead of MSE)
Use different network structure (e.g Convolution layers instead of Feedforward)

I have heard that there are clever ways to initialize better weights. For example you can choose the magnitude better: Glorot and Bengio (2010)

for sigmoid units: sample a Uniform(-r, r) with $r = \sqrt{\frac{6}{N_{in} + N_{out}}}$
or hyperbolic tangent units: sample a Uniform(-r, r) with $r =4 \sqrt{\frac{6}{N_{in} + N_{out}}}$

Is there any consistent way of initializing the weights better?

Do not post questions on multiple sites – LinkBerest Mar 31 '16 at 13:04 — LinkBerest, Mar 31 '16 at 13:04

score 3 · Answer 1 · answered Mar 28 '16 at 14:55

3

Recently, Batch Normalization was introduced for this sole purpose. Please find the paper here

answered Mar 28 '16 at 14:55

user52705

90

1

I am using this already. Is it enough by itself or maybe it can be further improved? – Joonatan Samuel Mar 28 '16 at 14:57
There were some extensions but i think this is the most popular one. I dont remember the exact name of the extensions. I think this should be enough also you can use a higher learning rate while optimizing. – user52705 Mar 28 '16 at 15:15
I have seen adaptive weight opt. algorithms work a lot better. But thanks a lot! – Joonatan Samuel Mar 28 '16 at 15:21

score 3 · Answer 2 · answered May 16 '16 at 00:53

The paper 'all you need is a good init' is a good relatively recent article about inits in deep learning. What I liked about it is that:

it has a short and effective literature survey on init methods, references included.
It achieves very good results without too many bells and whistles on cifar10.

score 3 · Accepted Answer · edited Dec 03 '17 at 21:51

3

As far as I know the two formulas you gave are pretty much the standard initialization. I did a literature review a while ago, please see my linked answer.

edited Dec 03 '17 at 21:51

amoeba

104,745

answered Jun 02 '16 at 02:33

Franck Dernoncourt

46,817
33
176
288

A potential editor suggests that the formulas for sigmoid and hyperbolic tangent be switched "to match the original paper". – gung - Reinstate Monica Aug 16 '16 at 13:48

Jefkine Kafunah · Answer 4 · 2016-08-21T21:57:07.830

Weights initialization depend on the activation function being used. Xavier and Bengio(2010) derived a method for initializing weights based on the assumption that the activations are linear. Their method resulted in the formula: \begin{align} W \sim U \left[ -\frac{\sqrt 6}{\sqrt {n_{i} + n_{i+1}}}, \frac{\sqrt 6}{\sqrt {n_{i} + n_{i+1}}} \right] \end{align}

For weights initialized using uniform distribution where $n_{i}$ represents $\text{fan in}$ and $n_{i+1}$ represents $\text{fan out}$.

He, Kaiming, et al.(2015) used a derivation method that considered use of ReLUs as the activation function and obtain a weight initialization formula:

\begin{align} W_l \sim \mathcal N \left({\Large 0}, \sqrt{\frac{2}{n_l}} \right). \end{align}

For weights initialized using Gaussian distribution whose standard deviation (std) is $\sqrt{\frac{2}{n_l}}$

Read a more comprehensive series of articles covering Mathematics behind weights initialization here.

Deep Neural Network weight initialization

4 Answers4

Linked

Related