Finding all occurrences of a list of words in a text usign Python Regex

Question

I need to find all occurrences of a list of words in a text using regex. For example, given the words:

words = {'i', 'me', 'my'}

and some

text = 'A book is on the table. I have a book on the table. My book is on the table. There is my book on the table.'

should return result = ["I", "My", "my"]

I'm using this:

re.findall(r"'|'.join(words))", text,flags=re.IGNORECASE))

But it's returning an empty list.

Also if I use this:

re.findall(r"(?=("+'|'.join(words)+r"))", text, flags=re.IGNORECASE))

returns:

['i', 'I', 'My', 'i', 'i', 'my']

which is incorrect.

https://stackoverflow.com/questions/54481198/python-match-multiple-substrings-in-a-string — AMC, Feb 08 '20 at 21:07
Please be more specific about what the issue is. Which part are you struggling with? — AMC, Feb 08 '20 at 21:07

score 1 · Answer 1 · answered Feb 08 '20 at 18:30

There is a problem in the way you define the regex. You are not joining the words, you are using a regex "'|'.join(words)", which leads in no matches.

>>> x = r"'|'.join(words)"
>>> x
"'|'.join(words)"

You can rewrite it as

>>> re.findall(r"\b({})\b".format('|'.join(words)), text[0], flags=re.IGNORECASE)
['I', 'My', 'my']

Note \b here is a world boundary that matches the empty string at the beginning or end of a word needed in order to only match full words.

score 1 · Answer 2 · answered Feb 08 '20 at 18:37

1

re.compile('|'.join(map(lambda x: '\\b' + x + '\\b', words)), 
           flags=re.IGNORECASE)
  .findall(text[0])

Putting \b on either side of words keeps "I" from matching things like "is".

answered Feb 08 '20 at 18:37

Michael Lorton

41,023
26
92
136

score 1 · Accepted Answer · answered Feb 08 '20 at 18:41

Regex are very interesting. As the rule of thumb, I write regex that I can read N days in the future. This is how I will do:

import re

words = ["I", "am", "my"]

text = ("A book is on the table. I have a book on the table. My book is on the table. There is my book on the table.")

# get values from my list, the can be preceded or exceeded by not a word e.g: Is it I?

pattern = r'\W.*?({})\W.*?'.format('|'.join(words))

s = re.findall(pattern, text, flags=re.IGNORECASE)

print(s)

Finding all occurrences of a list of words in a text usign Python Regex

3 Answers3