Proper masking in the transformer model

Question

Concerning the transformer model, a mask is used to mask out attention scores (replace with 1e-9) prior to the matrix multiplication with the value tensor. Regarding the masking, I have 3 short questions and would appreciate if you could clarify those:

Are the attention scores the only place (besides the loss) where masks are needed or should the input be masked out as well?

I am asking because is see implementations where a linear layer for the query, key and values is with bias=False is used.

Is the reason for setting bias=False to have zeros preserved in the output of the layers or is there a different explanation?
Should a padding_idx be used when learning word embeddings in order to zero out the padded tokens?

score 7 · Accepted Answer · answered Dec 18 '19 at 12:59

I will take as reference fairseq's implementation of the Transformer model. With this assumption:

In the transformer, masks are used for two purposes:
- Padding: in the multi-head attention, the padding tokens are explicitly ignored by masking them. This corresponds to parameter key_padding_mask.
- Self-attention causality: in the multi-head attention blocks used in the decoder, this mask is used to force predictions to only attend to the tokens at previous positions, so that the model can be used autoregressively at inference time. This corresponds to parameter attn_mask.
The weight mask, which is the combination of the padding and causal masks, is used to know which positions to fill with $-\infty$ before computing the softmax, which will be zero after it.
You don't need to preserve any zeros in the output, as the attention blocks take care of that (see answer (1)). In the original Transformer article, the attention works without bias, but the bias does not change performance. Actually, in fairseq the bias are used by default.
Yes, padding_idx is certainly used to zero out padded tokens.

Thanks for your answer. One follow up question regarding 3): But one could as well just remove (not use) the padding_idx since it has no purpose. The gradients will not flow back to the embedding weights for padded tokens anyway due to the key_padding_mask and attn_mask right? — beginneR, Dec 18 '19 at 20:18
Given the current implementation of the rest of the model, yes. — noe, Dec 18 '19 at 22:30

Proper masking in the transformer model

1 Answers1