In the attention mechanism, why don't we normalize after multiplying values?

Question

As this question says:

In scaled dot product attention, we scale our outputs by dividing the dot product by the square root of the dimensionality of the matrix:

The reason why is stated that this constrains the distribution of the weights of the output to have a standard deviation of 1.

My question is why don't we do the same after multiplying to $V$(values) for the same reason?

score 1 · Answer 1 · answered Apr 30 '23 at 08:43

Because what attention does is to control how much of the information in $V$ to use based on weights computed through the similarity between $Q$ and $K$.

When we multiply the attention weights by $V$, we are doing a weighed sum of the vectors in $V$ to get a new matrix that better represents $Q$ contextually within $V$. There is no need for variance normalization since its the resulting representation of the attention layer itself, rather than something that we use as a probability distribution. This is in opposition to the product $QK^T$, to which we then apply a softmax to obtain dimension-wise pseudo-probabilities.

In the attention mechanism, why don't we normalize after multiplying values?

1 Answers1

Linked