29

In Tensorflow's implementation of ResNet, I find they use variance scaling initializer, I also find xavier initializer is popular. I don't have too much experience on this, which is better in practice?

Ferdi
  • 5,179
Hanamichi
  • 653
  • 1
    For a more detailed explanation of Xavier's initialization, you can visit this link: https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/ It includes proper derivation of Xavier's initialization and intuition behind it. – Himanshu Singh Nov 11 '18 at 00:19

3 Answers3

51

Historical perspective

Xavier initialization, originally proposed by Xavier Glorot and Yoshua Bengio in "Understanding the difficulty of training deep feedforward neural networks", is the weights initialization technique that tries to make the variance of the outputs of a layer to be equal to the variance of its inputs. This idea turned out to be very useful in practice. Naturally, this initialization depends on the layer activation function. And in their paper, Glorot and Bengio considered logistic sigmoid activation function, which was the default choice at that moment.

Later on, the sigmoid activation was surpassed by ReLU, because it allowed to solve vanishing / exploding gradients problem. Consequently, there appeared a new initialization technique, which applied the same idea (balancing of the variance of the activation) to this new activation function. It was proposed by Kaiming He et al. in "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", and now it often referred to as He initialization.

In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer().

Summary

In summary, the main difference for machine learning practitioners is the following:

  • He initialization works better for layers with ReLu activation.
  • Xavier initialization works better for layers with sigmoid activation.
Sean
  • 3,887
Maxim
  • 3,309
5

Variance scaling is just a generalization of Xavier: http://tflearn.org/initializations/. They both operate on the principle that the scale of the gradients should be similar throughout all layers. Xavier is probably safer to use since it's withstood the experimental test of time; trying to pick your own parameters for variance scaling might inhibit training or cause your network to not earn at all.

liangjy
  • 666
0

To understand the difference of each initialization, we need to undersetand what is going on inside the neural network (NN) forward and backward propagation and how to manage the neuron output signals and backpropagation gradients.

We may focus on the NN architecture, but as Andrej Karpathy says in his video, training NN is a delicate balancing act of looking after the statistics of weights and signals in the forward and backward paths in the neural net to prevent vanishing and exploding signals and grandients. Please see Deep Learning AI - The importance of effective initialization.

Then, Andrej Karpathy's lecture Building makemore Part 3: Activations & Gradients, BatchNorm and Lecture 6 | Training Neural Networks I are the ones to learn from. NN - 18 - Weight Initialization 1 - What not to do? gives more easier to understand explanation of Lecture 6 | Training Neural Networks I .

We need to avoid neuron (and its activation) output signal from diminishing to 0. If the neuron output $y_i$ is 0, which is the next input $x_{i+1}$, then $y_{i+1}=x_{i+1}@W_{i+1}^T$ will be 0 as well. Hence, the NN is just keep forwarding 0 all the way up. Then the NN will not learn.

Diagram where higher (to the right) layer neuron output signal diminishing to 0. enter image description here

We also need to avoid neuron (and its activation) output signal from exploding. If the neuron output signal $y_{i}=x_{i}@W_{i}^T \rightarrow \infty $, what is the gradient to be updated to $x_i$ and $w_i$?

Diagram where higher layer neuron output signal exploding. enter image description here

The first step to avoid such diminish or explosion is to make sure the variance of $x_i$ and $W_i$ is 1.0. Because the variance of the product of two normal distribution $y_{i}=x_{i}@W_{i}^T$ will be $D$, where D is the dimension of $x$ and $W$, we need to divide the output signal statistics by $\sqrt{D}$ to keep the variance of $y_{i}$ to 1.0. See Variance of product of multiple independent random variables . This is what Xe initialization is doing for symmetric activations. For asymmetric activation such as ReLU, the variance is half $\frac {D}{2}$ which is what He initialization uses.

If the layer gets deeper or use different activations, then there are more elements to consider and it is difficult to manage the variance of $x$ and $W$. Batch Normalization can be the first successful attempt to dynamically manage at each NN layer, but it could be regarded as a hack which can backfire. In Let's build GPT: from scratch, in code, spelled out., Andrej Karpathy says that no one likes Batch Normalization layer and people want to remove it. He also said it brings so many bugs and he shot his foot by this.

Hence, how to manage the statistics in the neural network is still the recent research area. See:

mon
  • 1,468