0

As this question says:

In scaled dot product attention, we scale our outputs by dividing the dot product by the square root of the dimensionality of the matrix:

enter image description here

The reason why is stated that this constrains the distribution of the weights of the output to have a standard deviation of 1.

My question is why don't we do the same after multiplying to $V$(values) for the same reason?

Peyman
  • 564
  • 1
  • 3
  • 11

1 Answers1

1

Because what attention does is to control how much of the information in $V$ to use based on weights computed through the similarity between $Q$ and $K$.

When we multiply the attention weights by $V$, we are doing a weighed sum of the vectors in $V$ to get a new matrix that better represents $Q$ contextually within $V$. There is no need for variance normalization since its the resulting representation of the attention layer itself, rather than something that we use as a probability distribution. This is in opposition to the product $QK^T$, to which we then apply a softmax to obtain dimension-wise pseudo-probabilities.