0

I am trying to remove stopwords from my text.

I have tried using the code below.

from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text='I love coding'
my_text=re.sub("|".join(sw),"",my_text)
print(my_text)

Expected result: love coding. Actual result: I l cng (since 'o' and 've' are both found in the stopwords list "sw").

How can I get the expected result?

threxx
  • 1,171
  • 1
  • 28
  • 54
  • 1
    https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python possible duplication... – Aldric Jul 25 '19 at 14:58

2 Answers2

0

You need to replace words, not characrters:

from itertools import filterfalse
from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text = 'I love coding'
my_words = my_text.split() # naive split to words
no_stopwords = ' '.join(filterfalse(sw.__contains__, my_words))

You should also worry about splitting sentences, case sensitivity, etc.

There are libraries to do this properly since this is a common, non-trivial, problem.

Reut Sharabani
  • 29,003
  • 5
  • 68
  • 85
0

Split the sentence to words before removing the stop words and then run

from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
sentence = 'I love coding'
print([i for i in sentence.lower().split() if i not in stop])
>>> ['love', 'coding']
print(" ".join([i for i in sentence.lower().split() if i not in stop]))
>>> "love coding"
Sundeep Pidugu
  • 2,131
  • 2
  • 17
  • 37