3

As I understand, GPT-2 and BERT are using Byte-Pair Encoding which is a subword encoding. Since lots of start/end token is used such as <|startoftext|> and , as I image the encoder should encode the token as one single piece.

However, when I use pytorch BertTokenizer it seems the encoder also separate token into pieces. Is this correct behaviour?

from pytorch_pretrained_bert import BertTokenizer, cached_path
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False) 
tokenizer.tokenize('<s> This is a sentence <|endoftext|>')

The results are:

['<',
 's',
 '>',
 'This',
 'is',
 'a',
 'sentence',
 '<',
 '|',
 'end',
 '##oft',
 '##ex',
 '##t',
 '|',
 '>']
Kevin Ling
  • 143
  • 1
  • 3

1 Answers1

4

BERT is not trained with this kind of special tokens, so the tokenizer is not expecting them and therefore it splits them as any other piece of normal text, and they will probably harm the obtained representations if you keep them. You should remove these special tokens from the input text.

In the case of GPT-2, OpenAI trained it only with <|endoftext|>, but it has to be added after the tokenization. Some people mistakenly add it before tokenization, leading to problems. <|startoftext|> is specific to the library gpt-2-simple.

noe
  • 26,410
  • 1
  • 46
  • 76
  • Or we can also extend the vocab to add the new token's depending on the task? – Aditya Jan 13 '20 at 16:36
  • Extending the vocabulary of an already trained model is normally not a good idea (apart from being technically challenging due to the differences in tensor sizes). Also, your examples of special tokens don't add anything new, so I see no point in trying hard to keep them. – noe Jan 13 '20 at 22:26
  • As I read from some technical blogs, typically they will add these tokens as a sentence separator. So I am confused now – Kevin Ling Jan 14 '20 at 03:46
  • Or if another question, other than the normal punctuation, is there any way I can add in some special sentence separator? – Kevin Ling Jan 14 '20 at 03:47
  • BERT natively supports receiving 2 sentences separated by the token [SEP], but this is used for the next sentence classification task. – noe Jan 14 '20 at 06:43
  • @KevinLing please link to the posts you are referring to if you want to have further feedback on their approach. – noe Jan 14 '20 at 06:44
  • A bit off topic ncasas, What's your views on adding auxillary token's (like SEP,CLS etc) as some kind of relevant feats presence being encoded into Bert? e.g. [is_something_specific_present] etc – Aditya Jan 14 '20 at 16:24
  • 1
    @Aditya if you are finetuning BERT on data that use those special tokens for such specific purposes, it may work. If you are taking BERT's weights as is and expect that using those tokens in different ways from what BERT was trained on, I would not expect good results. – noe Jan 15 '20 at 10:33
  • @ncasas Here is the post, https://minimaxir.com/2019/09/howto-gpt2/. Also, in the library gpt-2-simple, it also use the token like <|startoftext|>, any thoughts on it? – Kevin Ling Jan 18 '20 at 11:50
  • I updated the answer with information regarding the blog post you linked. – noe Jan 20 '20 at 09:46