Regex for word exclusion in python

Question

I have a regular expression '[\w_-]+' which allows alphanumberic character or underscore.

I have a set of words in a python list which I don't want to allow

listIgnore = ['summary', 'config']

What changes need to be made in the regex?

P.S: I am new to regex

possible duplicate http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word — korylprince, Nov 07 '13 at 06:04

score 3 · Answer 1 · answered Nov 07 '13 at 06:25

>>> line="This is a line containing a summary of config changes"
>>> listIgnore = ['summary', 'config']
>>> patterns = "|".join(listIgnore)
>>> print re.findall(r'\b(?!(?:' + patterns + r'))[\w_-]+', line)
['This', 'is', 'a', 'line', 'containing', 'a', 'of', 'changes']

score 2 · Accepted Answer · answered Nov 07 '13 at 06:20

This question intrigued me, so I set about for an answer:

'^(?!summary)(?!config)[\w_-]+$'

Now this only works if you want to match the regex against a complete string:

>>> re.match('^(?!summary)(?!config)[\w_-]+$','config_test')
>>> (None)
>>> re.match('^(?!summary)(?!config)[\w_-]+$','confi_test')
>>> <_sre.SRE_Match object at 0x21d34a8>

So to use your list, just add in more (?!<word here>) for each word after ^ in your regex. These are called lookaheads. Here's some good info.

If you're trying to match within a string (i.e. without the ^ and $) then I'm not sure it's possible. For instance the regex will just pick a subset of the string that doesn't match. Example: ummary for summary.

Obviously the more exclusions you pick the more inefficient it will get. There's probably better ways to do it.

Probably, filtering all found values - like in thefourtheye's answer - will be more effective (re may be a memory-crunching bitch) — volcano, Nov 07 '13 at 06:27

Regex for word exclusion in python

2 Answers2