What's the difference between variance scaling initializer and xavier initializer?

Question

In Tensorflow's implementation of ResNet, I find they use variance scaling initializer, I also find xavier initializer is popular. I don't have too much experience on this, which is better in practice?

For a more detailed explanation of Xavier's initialization, you can visit this link: https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/ It includes proper derivation of Xavier's initialization and intuition behind it. — Himanshu Singh, Nov 11 '18 at 00:19

score 51 · Answer 1 · edited Oct 15 '23 at 11:33

Historical perspective

Xavier initialization, originally proposed by Xavier Glorot and Yoshua Bengio in "Understanding the difficulty of training deep feedforward neural networks", is the weights initialization technique that tries to make the variance of the outputs of a layer to be equal to the variance of its inputs. This idea turned out to be very useful in practice. Naturally, this initialization depends on the layer activation function. And in their paper, Glorot and Bengio considered logistic sigmoid activation function, which was the default choice at that moment.

Later on, the sigmoid activation was surpassed by ReLU, because it allowed to solve vanishing / exploding gradients problem. Consequently, there appeared a new initialization technique, which applied the same idea (balancing of the variance of the activation) to this new activation function. It was proposed by Kaiming He et al. in "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", and now it often referred to as He initialization.

In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer().

Summary

In summary, the main difference for machine learning practitioners is the following:

He initialization works better for layers with ReLu activation.
Xavier initialization works better for layers with sigmoid activation.

in pytorch what initializaiton is used? does it depend on the activation? — Charlie Parker, Jul 14 '18 at 01:22
Hi, for tanh() activation function, which initialization function should I use. — GoingMyWay, Oct 22 '18 at 09:29
@GoingMyWay, according to this https://pytorch.org/docs/stable/nn.html?highlight=calculate_gain#torch.nn.init.calculate_gain , you can use He initialization with gain 5/3. Not sure where that comes from though. — Pidhorskyi, Mar 10 '19 at 18:05
tanh is very similar to sigmoid, so you should use Xavier. — Pratik Khadloya, Apr 23 '20 at 13:58

score 5 · Answer 2 · answered Dec 18 '17 at 03:45

5

Variance scaling is just a generalization of Xavier: http://tflearn.org/initializations/. They both operate on the principle that the scale of the gradients should be similar throughout all layers. Xavier is probably safer to use since it's withstood the experimental test of time; trying to pick your own parameters for variance scaling might inhibit training or cause your network to not earn at all.

answered Dec 18 '17 at 03:45

liangjy

666

Thanks. What do you mean pick your own parameters for variance? – Hanamichi Dec 18 '17 at 21:46
In the TF implementation, this would be the factor parameter (which determines the ratio of the input variance to the output variance) – liangjy Jan 01 '18 at 02:35

mon · Answer 3 · 2024-02-04T09:30:03.280

To understand the difference of each initialization, we need to undersetand what is going on inside the neural network (NN) forward and backward propagation and how to manage the neuron output signals and backpropagation gradients.

We may focus on the NN architecture, but as Andrej Karpathy says in his video, training NN is a delicate balancing act of looking after the statistics of weights and signals in the forward and backward paths in the neural net to prevent vanishing and exploding signals and grandients. Please see Deep Learning AI - The importance of effective initialization.

Then, Andrej Karpathy's lecture Building makemore Part 3: Activations & Gradients, BatchNorm and Lecture 6 | Training Neural Networks I are the ones to learn from. NN - 18 - Weight Initialization 1 - What not to do? gives more easier to understand explanation of Lecture 6 | Training Neural Networks I .

We need to avoid neuron (and its activation) output signal from diminishing to 0. If the neuron output $y_i$ is 0, which is the next input $x_{i+1}$, then $y_{i+1}=x_{i+1}@W_{i+1}^T$ will be 0 as well. Hence, the NN is just keep forwarding 0 all the way up. Then the NN will not learn.

Diagram where higher (to the right) layer neuron output signal diminishing to 0.

We also need to avoid neuron (and its activation) output signal from exploding. If the neuron output signal $y_{i}=x_{i}@W_{i}^T \rightarrow \infty $, what is the gradient to be updated to $x_i$ and $w_i$?

Diagram where higher layer neuron output signal exploding.

The first step to avoid such diminish or explosion is to make sure the variance of $x_i$ and $W_i$ is 1.0. Because the variance of the product of two normal distribution $y_{i}=x_{i}@W_{i}^T$ will be $D$, where D is the dimension of $x$ and $W$, we need to divide the output signal statistics by $\sqrt{D}$ to keep the variance of $y_{i}$ to 1.0. See Variance of product of multiple independent random variables . This is what Xe initialization is doing for symmetric activations. For asymmetric activation such as ReLU, the variance is half $\frac {D}{2}$ which is what He initialization uses.

If the layer gets deeper or use different activations, then there are more elements to consider and it is difficult to manage the variance of $x$ and $W$. Batch Normalization can be the first successful attempt to dynamically manage at each NN layer, but it could be regarded as a hack which can backfire. In Let's build GPT: from scratch, in code, spelled out., Andrej Karpathy says that no one likes Batch Normalization layer and people want to remove it. He also said it brings so many bugs and he shot his foot by this.

Hence, how to manage the statistics in the neural network is still the recent research area. See:

What's the difference between variance scaling initializer and xavier initializer?

3 Answers3

Historical perspective

Summary