tokenizing Multiword expressions using MWETokenizer

Asked Sep 15 '21 at 20:14

Active Sep 15 '21 at 21:44

Viewed 80 times

I have a list of multiword expressions that are stored inside a pandas series with this format ('first word', 'second word') and I want to add them all to "MWETokenizer" following this post How to treat a phrase containing stopwords as a single token with Python nltk.tokenize .. so I added the first element in the list then I tried to iterate over the list to add the rest...

here's the code

from nltk.tokenize import MWETokenizer

mwetokenizer = MWETokenizer([('bite' ,'bullet')], separator='_')

size = len(MWE_series)-1

i = 1

for line in range(size):
    mwetokenizer.add_mwe((MWE_series[i]))
    i+=1

I don't get error and the code works, but only add the first MWE ('bite' ,'bullet') and ignore the rest inside the loop mwetokenizer.add_mwe((MWE_series[i])). how can I fix this problem?

edited Sep 15 '21 at 21:44

asked Sep 15 '21 at 20:14

user7045034

tokenizing Multiword expressions using MWETokenizer

0 Answers0