I am currently trying to get a vocabulary for BoW-vector generation out of a set of 200k scientific abstracts.
I do some basic filtering of tokens already like lowercasing, stop-word-removal, stemming, not taking tokens with size < 2, leaving tokens out that can be converted to a number and so on. But still I count more than 121k distinct tokens, which seems like a lot to me.
As I am quite new to all this I am wondering if there exist guidelines for how big such a vocabulary should be in average, maybe even depending on the originating field.