Transformers use just matrices to transform input embeddings, which is halfway to being a connected dense layer (add a bias and activation). So, why don't transformers have dense layers for encoding input into Query Key Value?
Asked
Active
Viewed 729 times
1
-
The following is about the same question and provides a suitable answer: https://ai.stackexchange.com/questions/40252/why-are-biases-typically-not-used-in-attention-mechanism/40256#40256 – Robin van Hoorn Mar 13 '24 at 14:11
1 Answers
-1
I suppose the main reason is that they were proposed as a more parallelizable alternative to RNNs that is quicker to compute. Hence, making them just big matrix multiplications achieves this and the lack of intermediate activations is just replaced by the sheer number of transformers that focus their attention on different parts of the input sequence.
ImotVoksim
- 74
- 2