Transformers commonly use layer normalization, as explained here: Why do transformers use layer norm instead of batch norm?.
One of the arguments in that post is that batch normalization is not used in Transformers because sentence length might vary in a given batch.
However, group normalization also works on a single input (doesn't require a batch). Would it be possible to use group normalization instead of layer normalization in a Transformer?