1

I am currently trying to get a vocabulary for BoW-vector generation out of a set of 200k scientific abstracts.

I do some basic filtering of tokens already like lowercasing, stop-word-removal, stemming, not taking tokens with size < 2, leaving tokens out that can be converted to a number and so on. But still I count more than 121k distinct tokens, which seems like a lot to me.

As I am quite new to all this I am wondering if there exist guidelines for how big such a vocabulary should be in average, maybe even depending on the originating field.

Wolfone
  • 113
  • 4

1 Answers1

1

I don't think there's any definitive answer for this and it will depend on your particular domain. Here's how I go about it:

  1. The English language contains about 20,000 words (or at least the most common) so I use that as a baseline
  2. I expand this number to account for some common misspellings
  3. Does my data contain special things like emojis? Emojis can still convey meaning, so I expand my vocabulary to include de-emoji'd text
  4. Does my data contain specialized text like scientific and/or academic terms? I expand my baseline number based on this.

Finally, you can always check your token index to find how many words you have out-of-vocabulary. If that number seems reasonable enough to you to proceed then you move forward, otherwise you expand your baseline number a little more.

I_Play_With_Data
  • 2,089
  • 3
  • 16
  • 40
  • 1
    Thanks for your answer. Working down or up from some language-specific baseline sounds reasonable. Thinking about that, it would be nice to have some DB where common language specifics/statistics are agglomerated per language. – Wolfone Jan 25 '19 at 13:33
  • 1
    @Wolfone that would indeed be helpful. I don't know this for a fact, but I think that some of the NLTK documentation speaks to this or you can at least infer it from some of the support they add for foreign languages. – I_Play_With_Data Jan 25 '19 at 19:26