4

For Chinese characters, Jun Da's dataset implies 1566 characters gives 95% coverage, 2285 characters gives 98% coverage, and 2838 characters gives 99% coverage. I'm wondering if there's a similar result for words.

Question: How many words give 98% coverage in Mandarin?

It seems substantially more challenging to compute for words, since it's hard to define what a "word" is. I did some Googling, but didn't immediately find anything concrete. This webpage says HSK6's 5000 words is enough for between 77.2% and 98.2%, which is so broad to be almost useless (perhaps this reflects the difficulty of this question).

This LingQ post gives

1000 words covers 73.0%
2000 words covers 82.2%
5000 words covers 91.64%

citing a Japanese book. This Reddit post says 7000 words gives 95% coverage (derived from the SUBTLEX-CH corpus). The Redditor goes on and states 16,000 words is enough for 98% coverage; it's not clear where this claim comes from. This is the best approximation I have for now.

This list gives 95% coverage for the first 10000 simplified Chinese "1-grams".

Rebecca J. Stones
  • 2,450
  • 1
  • 12
  • 26
  • 1
    'since it's hard to define what a "word" is'. It usually is, but since Standard Chinese has no morphology, I think it would be acceptable to use "lexical item" as a definition of "word". The trouble is that this solves only one side of the question, because you still need a good electronic dictionary and a representative corpus to do the actual computational work. – Tsundoku Nov 23 '22 at 12:37

0 Answers0