5

Given difficult learning task (e.g high dimensionality, inherent data complexity) Deep Neural Networks become hard to train. To ease many of the problems one might:

  1. Normalize && handpick quality data
  2. choose a different training algorithm (e.g RMSprop instead of Gradient Descent)
  3. pick a steeper gradient Cost function (e.g Cross Entropy instead of MSE)
  4. Use different network structure (e.g Convolution layers instead of Feedforward)

I have heard that there are clever ways to initialize better weights. For example you can choose the magnitude better: Glorot and Bengio (2010)

  • for sigmoid units: sample a Uniform(-r, r) with $r = \sqrt{\frac{6}{N_{in} + N_{out}}}$
  • or hyperbolic tangent units: sample a Uniform(-r, r) with $r =4 \sqrt{\frac{6}{N_{in} + N_{out}}}$

Is there any consistent way of initializing the weights better?

mdewey
  • 17,806

4 Answers4

3

Recently, Batch Normalization was introduced for this sole purpose. Please find the paper here

  • 1
    I am using this already. Is it enough by itself or maybe it can be further improved? – Joonatan Samuel Mar 28 '16 at 14:57
  • There were some extensions but i think this is the most popular one. I dont remember the exact name of the extensions. I think this should be enough also you can use a higher learning rate while optimizing. – user52705 Mar 28 '16 at 15:15
  • I have seen adaptive weight opt. algorithms work a lot better. But thanks a lot! – Joonatan Samuel Mar 28 '16 at 15:21
3

The paper 'all you need is a good init' is a good relatively recent article about inits in deep learning. What I liked about it is that:

  1. it has a short and effective literature survey on init methods, references included.
  2. It achieves very good results without too many bells and whistles on cifar10.
rhadar
  • 990
  • 9
  • 19
3

As far as I know the two formulas you gave are pretty much the standard initialization. I did a literature review a while ago, please see my linked answer.

amoeba
  • 104,745
Franck Dernoncourt
  • 46,817
  • 33
  • 176
  • 288
1

Weights initialization depend on the activation function being used. Xavier and Bengio(2010) derived a method for initializing weights based on the assumption that the activations are linear. Their method resulted in the formula: \begin{align} W \sim U \left[ -\frac{\sqrt 6}{\sqrt {n_{i} + n_{i+1}}}, \frac{\sqrt 6}{\sqrt {n_{i} + n_{i+1}}} \right] \end{align}

For weights initialized using uniform distribution where $n_{i}$ represents $\text{fan in}$ and $n_{i+1}$ represents $\text{fan out}$.

He, Kaiming, et al.(2015) used a derivation method that considered use of ReLUs as the activation function and obtain a weight initialization formula:

\begin{align} W_l \sim \mathcal N \left({\Large 0}, \sqrt{\frac{2}{n_l}} \right). \end{align}

For weights initialized using Gaussian distribution whose standard deviation (std) is $\sqrt{\frac{2}{n_l}}$

Read a more comprehensive series of articles covering Mathematics behind weights initialization here.