For Chinese characters, Jun Da's dataset implies 1566 characters gives 95% coverage, 2285 characters gives 98% coverage, and 2838 characters gives 99% coverage. I'm wondering if there's a similar result for words.
Question: How many words give 98% coverage in Mandarin?
It seems substantially more challenging to compute for words, since it's hard to define what a "word" is. I did some Googling, but didn't immediately find anything concrete. This webpage says HSK6's 5000 words is enough for between 77.2% and 98.2%, which is so broad to be almost useless (perhaps this reflects the difficulty of this question).
This LingQ post gives
1000 words covers 73.0%
2000 words covers 82.2%
5000 words covers 91.64%
citing a Japanese book. This Reddit post says 7000 words gives 95% coverage (derived from the SUBTLEX-CH corpus). The Redditor goes on and states 16,000 words is enough for 98% coverage; it's not clear where this claim comes from. This is the best approximation I have for now.
This list gives 95% coverage for the first 10000 simplified Chinese "1-grams".