Guidelines for vocabulary sizes for BoW

Question

I am currently trying to get a vocabulary for BoW-vector generation out of a set of 200k scientific abstracts.

I do some basic filtering of tokens already like lowercasing, stop-word-removal, stemming, not taking tokens with size < 2, leaving tokens out that can be converted to a number and so on. But still I count more than 121k distinct tokens, which seems like a lot to me.

As I am quite new to all this I am wondering if there exist guidelines for how big such a vocabulary should be in average, maybe even depending on the originating field.

score 1 · Accepted Answer · answered Jan 18 '19 at 14:18

I don't think there's any definitive answer for this and it will depend on your particular domain. Here's how I go about it:

The English language contains about 20,000 words (or at least the most common) so I use that as a baseline
I expand this number to account for some common misspellings
Does my data contain special things like emojis? Emojis can still convey meaning, so I expand my vocabulary to include de-emoji'd text
Does my data contain specialized text like scientific and/or academic terms? I expand my baseline number based on this.

Finally, you can always check your token index to find how many words you have out-of-vocabulary. If that number seems reasonable enough to you to proceed then you move forward, otherwise you expand your baseline number a little more.

Thanks for your answer. Working down or up from some language-specific baseline sounds reasonable. Thinking about that, it would be nice to have some DB where common language specifics/statistics are agglomerated per language. — Wolfone, Jan 25 '19 at 13:33
@Wolfone that would indeed be helpful. I don't know this for a fact, but I think that some of the NLTK documentation speaks to this or you can at least infer it from some of the support they add for foreign languages. — I_Play_With_Data, Jan 25 '19 at 19:26

Guidelines for vocabulary sizes for BoW

1 Answers1