Concatenating the positional encoding

Question

As per "Attention is all you need" etc., positional encoding is added to the embedded word vector input. My knee-jerk reaction is that this would muddle the "signal" of the word vector. Since word order is not preserved, this additive noise, for instance, could make a word look like a different word (in a different position): $w_a + p_a = x = w_b + p_b$.

Is performance (from shorter input dimension) the main reason to add the positional information versus concatenating, or is adding theoretically sound?

Edit: I think a source of confusion in all of this is the assumption of whether the model adopts a pre-trained word embedder, like Word2Vec, or trains their embedder from scratch. Adding positional encoding to a pre-trained embedder might throw off the attention dot product?

You may want to do some calculations to understand just how little the positional embeddings move the word embeddings to see if that helps you intuitively. — David Hoelzer, Jan 04 '24 at 14:54
Range from PE would be [-1,1] which I imagine is also the word embed range, no? — SuaveSouris, Jan 04 '24 at 15:02
I'd say elegance is the main reason in conjunction with the evidence that the addition works well: If you'd concatenate you'd need another dimensionality reducing transformation to make subsequent residual connections work. Also check this related question, I like the orthogonality argument in the first answer, second answer is my own attempt of explanation ;) — Chillston, Jan 04 '24 at 19:27
How is this question different from Why are embeddings added, not concatenated?" that @Chillston linked above? — Eponymous, Jan 06 '24 at 20:28

Concatenating the positional encoding

0 Answers0