1

I need to find all occurrences of a list of words in a text using regex. For example, given the words:

words = {'i', 'me', 'my'}

and some

text = 'A book is on the table. I have a book on the table. My book is on the table. There is my book on the table.'

should return result = ["I", "My", "my"]

I'm using this:

re.findall(r"'|'.join(words))", text,flags=re.IGNORECASE))

But it's returning an empty list.

Also if I use this:

re.findall(r"(?=("+'|'.join(words)+r"))", text, flags=re.IGNORECASE))

returns:

['i', 'I', 'My', 'i', 'i', 'my']

which is incorrect.

tripleee
  • 158,107
  • 27
  • 234
  • 292
Hossein
  • 1,074
  • 1
  • 14
  • 29
  • https://stackoverflow.com/questions/54481198/python-match-multiple-substrings-in-a-string – AMC Feb 08 '20 at 21:07
  • Please be more specific about what the issue is. Which part are you struggling with? – AMC Feb 08 '20 at 21:07

3 Answers3

1

There is a problem in the way you define the regex. You are not joining the words, you are using a regex "'|'.join(words)", which leads in no matches.

>>> x = r"'|'.join(words)"
>>> x
"'|'.join(words)"

You can rewrite it as

>>> re.findall(r"\b({})\b".format('|'.join(words)), text[0], flags=re.IGNORECASE)
['I', 'My', 'my']

Note \b here is a world boundary that matches the empty string at the beginning or end of a word needed in order to only match full words.

abc
  • 10,886
  • 2
  • 22
  • 46
1
re.compile('|'.join(map(lambda x: '\\b' + x + '\\b', words)), 
           flags=re.IGNORECASE)
  .findall(text[0])

Putting \b on either side of words keeps "I" from matching things like "is".

Michael Lorton
  • 41,023
  • 26
  • 92
  • 136
1

Regex are very interesting. As the rule of thumb, I write regex that I can read N days in the future. This is how I will do:

import re

words = ["I", "am", "my"]

text = ("A book is on the table. I have a book on the table. My book is on the table. There is my book on the table.")

# get values from my list, the can be preceded or exceeded by not a word e.g: Is it I?

pattern = r'\W.*?({})\W.*?'.format('|'.join(words))

s = re.findall(pattern, text, flags=re.IGNORECASE)

print(s)
Prayson W. Daniel
  • 12,063
  • 2
  • 45
  • 48