What is the cost function of a transformer?

Question

The paper Attention Is All You Need describes the transformer architecture that has an encoder and a decoder.

However, I wasn't clear on what the cost function to minimize is for such an architecture.

Consider a translation task, for example, where give an English sentence $x_{english} = [x_0, x_1, x_2, \dots, x_m]$, the transformer decodes the sentence into a French sentence $x_{french}' = [x_0', x_1', \dots, x_n']$. Let's say the true label is $y_{french} = [y_0, y_1, \dots, y_p]$.

What is the object function of the transformer? Is it the MSE between $x_{french}'$ and $y_{french}$? And does it have any weight regularization terms?

score 7 · Accepted Answer · edited Oct 19 '21 at 12:20

7

I took a look at the Tensor2Tensor's source code implementation, and it seems like the loss function is the cross-entropy between the predicted probability matrix $\|\text{sentence length}\| \times \|\text{vocab}\|$ (right before taking the argmax to find the token to output), and the $\|\text{sentence length}\|$-length vector of token IDs as the true label.

edited Oct 19 '21 at 12:20

nbro

40,472
12
105
192

answered Dec 10 '20 at 07:16

user3667125

1,570
6
13

Note that they reshape the out_logits variable there, in the line that you're linking us to, i.e. tf.reshape(out_logits, [-1, VOCAB_SIZE]), so the cross-entropy seems to be computed between two vectors and not a matrix and a vector (but I didn't really executed that code and tried to output the shape of those tensors/variables). It may be a good idea to do it, just to confirm the exact shape. – nbro Dec 10 '20 at 10:21
1

@nbro I just ran the code, and can confirm that is the case. out_logits originally has the shape (3, 7, 1, 1, 10), which is to (BATCH_SIZE, SENTENCE_LENGTH, ??, ??, VOCAB_SIZE). The ?? are bookkeeping dimensions because the tensor2tensor model expects a 5-D tensor, but are then immediately stripped out afterwards. Then, tf.reshape(out_logits, [-1, VOCAB_SIZE]) resizes out_logits to (21, 10), which is (BATCH_SIZE * SENTENCE_LENGTH, VOCAB_SIZE). Btw, for tf.reshape, seems like adding a -1 as a dimension makes it a flexible sized dimension, so it results in a 2D matrix rather than a 1D vector. – user3667125 Dec 13 '20 at 00:44

What is the cost function of a transformer?

1 Answers1