Removing single letter stopwords without removing the letter from words containing it

Question

I am trying to remove stopwords from my text.

I have tried using the code below.

from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text='I love coding'
my_text=re.sub("|".join(sw),"",my_text)
print(my_text)

Expected result: love coding. Actual result: I l cng (since 'o' and 've' are both found in the stopwords list "sw").

How can I get the expected result?

https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python possible duplication... — Aldric, Jul 25 '19 at 14:58

score 0 · Answer 1 · answered Jul 25 '19 at 14:59

You need to replace words, not characrters:

from itertools import filterfalse
from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text = 'I love coding'
my_words = my_text.split() # naive split to words
no_stopwords = ' '.join(filterfalse(sw.__contains__, my_words))

You should also worry about splitting sentences, case sensitivity, etc.

There are libraries to do this properly since this is a common, non-trivial, problem.

score 0 · Accepted Answer · answered Jul 25 '19 at 15:01

Split the sentence to words before removing the stop words and then run

from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
sentence = 'I love coding'
print([i for i in sentence.lower().split() if i not in stop])
>>> ['love', 'coding']
print(" ".join([i for i in sentence.lower().split() if i not in stop]))
>>> "love coding"

Removing single letter stopwords without removing the letter from words containing it

2 Answers2