0

The purpose of weight initialization in the neural network is to keep the variance of calculation output in the layers to 1.0, and it depends on the calculations involved in the layers.

Initializing weight W with Xavier initialization for Matrix Multiplication X@W.T at Self Attention in Transformer Architecture will use the standard deviation $\frac{1}{\sqrt{D}}$ to sample values from $N(\mu=0,\sigma=\frac{1}{\sqrt{D}})$ so that the product has variance 1.0, providing the dimensions of X and W are both D and X follows the normal distribution.

The dimension D of Transformer based BERT is 768, so $\sigma$ is expected to be 0.036. But BertConfig says it is using 0.02. Where is 0.02 coming from?

initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

mon
  • 1,468

1 Answers1

2

According to the article Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention [Zhang et al., 2019], the Transformer architecture suffers from poor convergence due to gradient vanishing caused by the interaction between residual connections and layer normalization, and adopting a much smaller standard deviation (0.02) for initialization helps mitigate this issue. From section 5 of the mentioned article (emphasis mine):

The gradient norm is preserved better through the self-attention layer than the encoder-decoder attention, which offers insights on the successful training of the deep Transformer in BERT (Devlin et al., 2019) and GPT (Radford et al., 2018), where encoder-decoder attention is not involved. However, results in Table 1 also suggests that the self-attention sublayer in the encoder is not strong enough to counteract the gradient loss in the feedforward sublayer. That is why BERT and GPT adopt a much smaller standard deviation (0.02) for initialization, [...].

P.S: this answer is the same as my answer to the cross-posted question on the Data Science SE.

noe
  • 189
  • 5