0

As per "Attention is all you need" etc., positional encoding is added to the embedded word vector input. My knee-jerk reaction is that this would muddle the "signal" of the word vector. Since word order is not preserved, this additive noise, for instance, could make a word look like a different word (in a different position): $w_a + p_a = x = w_b + p_b$.

Is performance (from shorter input dimension) the main reason to add the positional information versus concatenating, or is adding theoretically sound?

Edit: I think a source of confusion in all of this is the assumption of whether the model adopts a pre-trained word embedder, like Word2Vec, or trains their embedder from scratch. Adding positional encoding to a pre-trained embedder might throw off the attention dot product?

SuaveSouris
  • 101
  • 3
  • You may want to do some calculations to understand just how little the positional embeddings move the word embeddings to see if that helps you intuitively. – David Hoelzer Jan 04 '24 at 14:54
  • Range from PE would be [-1,1] which I imagine is also the word embed range, no? – SuaveSouris Jan 04 '24 at 15:02
  • 1
    I'd say elegance is the main reason in conjunction with the evidence that the addition works well: If you'd concatenate you'd need another dimensionality reducing transformation to make subsequent residual connections work. Also check this related question, I like the orthogonality argument in the first answer, second answer is my own attempt of explanation ;) – Chillston Jan 04 '24 at 19:27
  • How is this question different from Why are embeddings added, not concatenated?" that @Chillston linked above? – Eponymous Jan 06 '24 at 20:28

0 Answers0