Why do transformer Key Query Value layers not have biases or activations?

Question

Transformers use just matrices to transform input embeddings, which is halfway to being a connected dense layer (add a bias and activation). So, why don't transformers have dense layers for encoding input into Query Key Value?

The following is about the same question and provides a suitable answer: https://ai.stackexchange.com/questions/40252/why-are-biases-typically-not-used-in-attention-mechanism/40256#40256 — Robin van Hoorn, Mar 13 '24 at 14:11

score -1 · Answer 1 · answered Jun 16 '22 at 13:22

I suppose the main reason is that they were proposed as a more parallelizable alternative to RNNs that is quicker to compute. Hence, making them just big matrix multiplications achieves this and the lack of intermediate activations is just replaced by the sheer number of transformers that focus their attention on different parts of the input sequence.

Why do transformer Key Query Value layers not have biases or activations?

1 Answers1