I have a corpus with 6040592 words and 309074 types (different words). Knowing this information it is possible to know the optimal size of bag of words vectors in order to represent phrases?
I am using a data structure like this:
{'contains(The)': True, 'contains(waste)': False, 'contains(lot)': False, ...}
To represent this:
The movie is lovely.
Do Zipf's law could help to know how many words include in the model?
for tokens $wc corpus, and for types $cat corpus | tr " " "\n" | sort | uniq -c | wc -l
– alemol Mar 01 '16 at 21:27