12

I'm working with a Python 2.7.2 script to find lists of words inside of a text file that I'm using as a master word list.

I am calling the script in a terminal window, inputting any number of regular expressions, and then running the script.

So, if I pass in the two regular expressions "^.....$" and ".*z" it will print every five letter word that contains at least one "z".

What I am trying to do is add another regular expression to EXCLUDE a character from the strings. I would like to print out all words that have five letters, a "z", but -not- a "y".

Here is the code:

import re
import sys

def read_file_to_set(filename):
    words = None
    with open(filename) as f:
        words = [word.lower() for word in f.readlines()]
    return set(words)

def matches_all(word, regexes):
    for regex in regexes:
        if not regex.search(word):
            return False
    return True

if len(sys.argv) < 3:
    print "Needs a source dictionary and a series of regular expressions"
else:
    source = read_file_to_set(sys.argv[1])
    regexes = [re.compile(arg, re.IGNORECASE)
               for arg in sys.argv[2:]]
    for word in sorted(source):
        if matches_all(word.rstrip(), regexes):
            print word,

What modifiers can I put onto the regular expressions that I pass into the program to allow for me to exclude certain characters from the strings it prints?

If that isn't possible, what needs to be implemented in the code?

user1251007
  • 14,591
  • 13
  • 48
  • 73
Zack Cruise
  • 121
  • 1
  • 1
  • 3

2 Answers2

28

Specifying a character that doesn't match is done with like this (this matches anything except a lower case letter):

[^a-z]

So to match a string that does not contain "y", the regex is: ^[^y]*$

Character by character explanation:

^ means "beginning" if it comes at the start of the regex. Similarly, $ means "end" if it comes at the end. [abAB] matches any character within, or a range. For example, match any hex character (upper or lower case): [a-fA-F0-9]

* means 0 or more of the previous expression. As the first character inside [], ^ has a different meaning: it means "not". So [^a-fA-F0-9] matches any non-hex character.

When you put a pattern between ^ and $, you force the regex to match the string exactly (nothing before or after the pattern). Combine all these facts:

^[^y]*$ means string that is exactly 0 or more characters that are not 'y'. (To do something more interesting, you could check for non-numbers: ^[^0-9]$

piojo
  • 5,701
  • 1
  • 21
  • 32
  • 1
    This is exactly what I was looking for! Would you mind explaining what each character in the "^[^y]*$" is doing? I am only beginning to work with computer programming / Python and I have seen each of those characters in the documentation for regexes but could not think to combine them like that for that result. – Zack Cruise Nov 12 '13 at 09:28
  • This is very helpful explanation thank you. Can you explain how to make it so that you can combine 2 regex 'NOT'/exclude statements? For example, what would it look like if you wanted to match some string that is not y AND it's also not q? – EazyC Jan 20 '16 at 10:08
  • 1
    @EazyC If the exclusion is characters, it's just [^yq]*. If full strings, it's actually quite a bit harder. I don't know how offhand, but I think you can do it with negative lookaheads/lookbehinds. Regular expressions are about matching characters, but they aren't as powerful when it comes to *not* matching characters. Hence, some regex engines don't even support lookaheads/lookbehinds. (The difference is that they match isn't about the current character, but potential future/previous characters.) So to do a search for "not this and not that", you need negative lookahead/lookbehind. – piojo Feb 18 '16 at 11:17
10

You can accomplish this with negative look arounds. This isn't a task that Regexs are particularly fast at, but it does work. To match everything except a sub-string foo, you can use:

>>> my_regex = re.compile(r'^((?!foo).)*$', flags = re.I)
>>> print my_regex.match(u'IMatchJustFine')
<_sre.SRE_Match object at 0x1034ea738>
>>> print my_regex.match(u'IMatchFooFine')
None

As others have pointed out, if you're only matching a single character, then a simple not will suffice. Longer and more complex negative matches would need to use this approach.

VooDooNOFX
  • 4,407
  • 1
  • 21
  • 22