Concerning the transformer model, a mask is used to mask out attention scores (replace with 1e-9) prior to the matrix multiplication with the value tensor. Regarding the masking, I have 3 short questions and would appreciate if you could clarify those:
- Are the attention scores the only place (besides the loss) where masks are needed or should the input be masked out as well?
I am asking because is see implementations where a linear layer for the query, key and values is with bias=False is used.
Is the reason for setting
bias=Falseto have zeros preserved in the output of the layers or is there a different explanation?Should a padding_idx be used when learning word embeddings in order to zero out the padded tokens?
padding_idxsince it has no purpose. The gradients will not flow back to the embedding weights for padded tokens anyway due to thekey_padding_maskandattn_maskright? – beginneR Dec 18 '19 at 20:18