6

I'd like to match strings like:

45 meters?
45, meters?
45?
45 ?

but not strings like:

45 meters you?
45 you  ?
45, and you?

In both cases the question mark must be at the end. So, essentially I want to exclude all those strings containing the word "you".

I've tried the following regex:

'\d+.*(?!you)\?$'

but it matches the second case (probably because of .*)

f_ficarola
  • 1,623
  • 3
  • 24
  • 44

2 Answers2

11

You could try this regex to match all the lines which doesn't have the string you with ? at the last,

^(?!.*you).*\?$

Explanation:

A negative lookahead is used in this regex. What it does actually means, it checks for the lines which contains a string you. It matches all the lines except the line containing the string you.

DEMO

Avinash Raj
  • 166,785
  • 24
  • 204
  • 249
10

There's a neat trick to exclude some matches from a regex, which you can use here:

>>> import re
>>> corpus = """
... 45 meters?
... 45?
... 45 ?
... 45 meters you?
... 45 you  ?
... 45, and you?
... """
>>> pattern = re.compile(r"\d+[^?]*you|(\d+[^?]*\?)")
>>> re.findall(pattern, corpus)
['45 meters?', '45?', '45 ?', '', '', '']

The downside is that you get empty matches when the exclusion kicks in, but those are easily filtered out:

>>> filter(None, re.findall(pattern, corpus))
['45 meters?', '45?', '45 ?']

How it works:

The trick is that we only pay attention to captured groups ... so the left hand side of the alternation - \d+[^?]*you (or "digits followed by non-?-characters followed by 'you'") matches what you don't want, and then we forget about it. Only if the left hand side doesn't match is the right hand side - (\d+[^?]*\?) (or "digits followed by non-?-characters followed by '?') - matched, and that one is captured.

Zero Piraeus
  • 52,181
  • 26
  • 146
  • 158