Why do LLaMa and its variants have non-“round” numbers of parameters?

Question

LLaMa was released in several sizes, with 7B, 13B, 33B, and 65B parameters. These values look a little weird, because they are very close to powers of two (8, 16, 32, 64) that would be more conventionally considered “round numbers” in software. Why were these specific numbers chosen?

Can you please provide a reference that explains what these "round numbers" are in "software"? I've been programming for a few years and I've never encountered such an expression, which, of course, doesn't mean that it doesn't exist or doesn't define a value concept. — nbro, Jun 06 '23 at 10:07

score 0 · Answer 1 · answered Jun 08 '23 at 12:31

At first I thought that they were just leaving some headroom by using less-than power-of-two number of parameters, but 33 and 65 actually break this pattern. But I still think it is memory-usage related and they aim to have the full system working at 16, 32, 64 etc. GBs of memory.

There are some hints at their GitHub repository: https://github.com/facebookresearch/llama/blob/main/FAQ.md#3

Accounting for 14GB of memory for the model weights (7B model), this leaves 16GB available for the decoding cache which stores 2 * 2 * n_layers * max_batch_size * max_seq_len * n_heads * head_dim bytes.

With default parameters, this cache was about 17GB (2 * 2 * 32 * 32 * 1024 * 32 * 128) for the 7B model.

Why do LLaMa and its variants have non-“round” numbers of parameters?

1 Answers1