How does vocabulary size affect quality?

Question

I think the vocabulary size in LLMs makes two trade-offs:

The bigger tokens you have, the less frequent they will be.
The more tokens you have, the more parameters you dedicate to input and output.

I'm looking for a chart of the effect of the tokenizer vocabulary size on some quality metric. It could be a chart with a fixed number of total parameters, so both trade-offs have an impact on quality. Or it could be a chart where the architecture is fixed and the number of parameters grows with vocabulary size, to only show the first trade-off.

The best I could find is this chart from Impact of Tokenization on Language Models: An Analysis for Turkish. But it only shows increasing quality. Surely at some point increasing the vocabulary starts to have a negative effect on quality?

Why is your intuition that vocabularly would have a negative impact on quality rather than just asymtotically less benefit? — Bruce Adams, Jun 18 '23 at 19:04
If we grow the vocabulary by growing tokens, then more and more tokens will appear only once in the whole corpus. (Imagine if this comment were a single token.) At that point it's impossible to predict anything about that token and the part of the corpus covered by these low-frequency tokens is effectively lost. I suppose this extreme is very far. If you keep the total number of parameters fixed, however, then as tokens use more parameters you will have fewer left for internal layers. This would also cripple the model. At some point all parameters are used by tokens. — Daniel Darabos, Jun 19 '23 at 15:14

How does vocabulary size affect quality?

0 Answers0