3

Transformers commonly use layer normalization, as explained here: Why do transformers use layer norm instead of batch norm?.

One of the arguments in that post is that batch normalization is not used in Transformers because sentence length might vary in a given batch.

However, group normalization also works on a single input (doesn't require a batch). Would it be possible to use group normalization instead of layer normalization in a Transformer?

Foobar
  • 349
  • Original paper on group norm: https://eccv2018.org/openaccess/content_ECCV_2018/papers/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.pdf A useful looking article comparing batch/layer/instance/group norms: https://towardsdatascience.com/what-is-group-normalization-45fe27307be7 (not specifically about transformers, though, so not sure if it answers your question) – Darren Cook Jul 15 '22 at 09:40

1 Answers1

2

Transformer data is B x N x D, where B is batch size, N is the max sentence length in the batch, and D is the dimension. There is no equivalent of the channel you get in image data (B x C x W x H).

GroupNorm splits the channel dimension into groups, and finds the means and variance of each group. That pytorch doc page says:

num_channels must be divisible by num_groups

As num_channels is effectively 1 for a transformer, 1 is also the only possible value for num_groups, making it pointless.

Additional: Even if you could, is there any reason to prefer it over LayerNorm? Answers on the link in the question mention that LayerNorm works well over a distributed model (https://stats.stackexchange.com/a/511253/5503) and another references https://arxiv.org/abs/2201.03545 saying that LayerNorm works just as well as BatchNorm in ConvNet.

Darren Cook
  • 2,107
  • 1
  • 14
  • 26