Why do transformers use layer norm instead of batch norm?

Question

Both batch norm and layer norm are common normalization techniques for neural network training.

I am wondering why transformers primarily use layer norm.

LayerNorm in Transformer applies standard normalization just on the last dimension of inputs, mean = x.mean(-1, keepdim=True), std = x.std(-1, keepdim=True), which operates on the embedding feature of one single token, see class LayerNorm definition at Annotated Transformer. Note that a causal mask is applied before LayerNorm. — Kuo, Mar 19 '24 at 06:24
Caveat, the answer is not right about the details of LayerNorm computation in Transformer, but general explanation is OK. It also missed the point of causal problem entailed by other kinds of Normalization. — Kuo, Mar 19 '24 at 06:35

score 58 · Accepted Answer · answered Jan 18 '21 at 05:16

It seems that it has been the standard to use batchnorm in CV tasks, and layernorm in NLP tasks. The original Attention is All you Need paper tested only NLP tasks, and thus used layernorm. It does seem that even with the rise of transformers in CV applications, layernorm is still the most standardly used, so I'm not completely certain as to the pros and cons of each. But I do have some personal intuitions -- which I'll admit aren't grounded in theory, but which I'll nevertheless try to elaborate on in the following.

Recall that in batchnorm, the mean and variance statistics used for normalization are calculated across all elements of all instances in a batch, for each feature independently. By "element" and "instance," I mean "word" and "sentence" respectively for an NLP task, and "pixel" and "image" for a CV task. On the other hand, for layernorm, the statistics are calculated across the feature dimension, for each element and instance independently (source). In transformers, it is calculated across all features and all elements, for each instance independently. This illustration from this recent article conveys the difference between batchnorm and layernorm:

(in the case of transformers, where the normalization stats are calculated across all features and all elements for each instance independently, in the image that would correspond to the left face of the cube being colored blue.)

Now onto the reasons why batchnorm is less suitable for NLP tasks. In NLP tasks, the sentence length often varies -- thus, if using batchnorm, it would be uncertain what would be the appropriate normalization constant (the total number of elements to divide by during normalization) to use. Different batches would have different normalization constants which leads to instability during the course of training. According to the paper that provided the image linked above, "statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented." (The paper is concerned with an improvement upon batchnorm for use in transformers that they call PowerNorm, which improves performance on NLP tasks as compared to either batchnorm or layernorm.)

Another intuition is that in the past (before Transformers), RNN architectures were the norm. Within recurrent layers, it is again unclear how to compute the normalization statistics. (Should you consider previous words which passed through a recurrent layer?) Thus it's much more straightforward to normalize each word independently of others in the same sentence. Of course this reason does not apply to transformers, since computing on words in transformers has no time-dependency on previous words, and thus you can normalize across the sentence dimension too (in the picture above that would correspond to the entire left face of the cube being colored blue).

It may also be worth checking out instance normalization and group normalization, I'm no expert on either but apparently each has its merits.

The layer norm as shown here seems to be actually a (transposed?) instance norm. — HappyFace, Nov 29 '21 at 12:43
Layernorm in transformers is actually done exactly how it is shown in the diagram, therefore, the statement: "In transformers, it is calculated across all features and all elements, for each instance independently" - is wrong. And the next sentence is wrong as well: "(in the case of transformers, where the normalization stats are calculated across all features and all elements for each instance independently, in the image that would correspond to the left face of the cube being colored blue.)" — MichaelSB, Dec 09 '22 at 01:33
In fact, layernorm in transformers is identical to instance normalization. I suspect it's only called "layernorm" because previously that name made sense for RNNs, but in transformers, calling it 'instance norm' would be more appropriate, imo. — MichaelSB, Dec 09 '22 at 01:46
I think that the answer is upvoted mainly because of the figure. But TBH it does not explain it correctly, rather introduces more confusion. — hans, Mar 05 '23 at 21:57
@MichaelSB the layout of LayerNorm is just a planar layer for CV, if viewing HW combined as one dimension and C as another one. — Kuo, Mar 18 '24 at 19:05
@LernerZhang They are the same. In Group norm paper, the figure of Layer norm expanded as in 2 dimensions for a single training case, which is just the operation defined in the original paper of Layer norm. — Kuo, Mar 19 '24 at 06:56

score 31 · Answer 2 · edited Feb 26 '21 at 04:00

31

A less known issue of Batch Norm is that how hard it is to parallellize batch-normalized models. Since there is dependence between elements, there is additional need for synchronization across devices. While this is not an issue for most vision models, which tends to be used on a small set of devices, Transformers really suffer from this problem, as they rely on large-scale setups to counter their quadratic complexity. In this regard, layer norm provides some degree of normalization while incurring no batch-wise dependence.

edited Feb 26 '21 at 04:00

user67275

1,097

answered Feb 25 '21 at 21:29

Ygor Rebouças Serpa

411

It's true that this is a complication and not so nice, but in practice BN is not synchronized, i.e. completely data parallel. The BN group size (which equals the per-worker batch size) is treated as a hyperparameter of BN itself and not of distributed training. C.f "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" by Goyal et al. – Sebastian Hoffmann Jan 09 '23 at 11:19

cwin · Answer 3 · 2023-07-12T15:26:24.503

BatchNorm was a choice made by early ConvNet designs primarily targeting vision. NLP did not follow that preferring LayerNorm instead. The question as to whether LayerNorm would be a better choice for ConvNet and vision is investigated in the 2022 paper [1]. Alongside other changes it observes "ConvNet model does not have any difficulties training with LN; in fact, the performance is slightly better."

[1]: Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s (arXiv:2201.03545). arXiv.

score 2 · Answer 4 · answered Feb 11 '22 at 06:44

If you want to choose a sample box of data which contains all the feature but smaller in length of single dataframe row wise and small number in group of single dataframe sent as batch to dispatch -> layer norm

For transformer such normalization is efficient as it will be able to create relevance matrix in one go on all the entity.

And the first answers explains this very well in both modality [text and image]

Why do transformers use layer norm instead of batch norm?

4 Answers4

Linked