5

(Note: not a duplicate of Why can't you use repetition quantifiers in zero-width look behind assertions; see end of post.)

I'm trying to write a grep -P (Perl) regex that matches B, when it is not preceded by A -- regardless of whether there is intervening whitespace.

So, I tried this negative lookbehind, and tested it in regex101.com:

(?<!A)\s*B

This causes "AB" not to be matched, which is good, but "A B" does result in a match, which is not what I want.

I am not exactly sure why this is. It has something to do with the fact that \s* matches the empty string "", and you can say that there are, as such, infinity matches of \s* between A and B. But why does this affect "A B" but not "AB"?

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

I posted this before and it was incorrectly marked as a duplicate question. The variable-length thing I'm looking for is part of the match, not part of the negative lookbehind itself -- so this quite different from the other question. Yes, I could put the \s* inside the negative lookbehind, but I haven't done so (and doing so is not supported, as the other question explains). Also, I am particularly interested in why the alternate regex I post above works, since I know it works but I'm not exactly sure why. The other question did not help answer that.

Community
  • 1
  • 1
std_answ
  • 969
  • 1
  • 10
  • 15

1 Answers1

6

But why does this affect "A B" but not "AB"?

Regexes match at a position, which it is helpful to think of as being between characters. In "A B" there is a position (after the space and before the B) where (?<!A) succeeds (because there isn't an A immediately preceding; there's a space instead), and \s*B succeeds (\s* matches the empty string, and B matches B), so the entire pattern succeeds.

In "AB" there is no such position. The only place where \s*B can match (immediately before the B), is also immediately after the A, so (?<!A) cannot succeed. There are no positions that satisfy both, so the pattern as a whole can't succeed.

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

This works because (?<![A\s]) will not succeed immediately after an A or after a space. So now the lookbehind forbids any match position that has spaces before it. If there are any spaces before the B, they have to be consumed by the \s* portion of the pattern, and the match position must be before them. If that position also doesn't have an A before it, the lookbehind can succeed and the pattern as a whole can match.

This is a trick that's made possible by the fact that \s is a fixed-width pattern that matches at every position inside of a non-empty \s* match. It can't be extended to the general case of any pattern between the (non-)A and the B.

Community
  • 1
  • 1
hobbs
  • 206,796
  • 16
  • 199
  • 282
  • Makes sense, thanks! Re: your first point: It took me a minute to realize that "so (? – std_answ Mar 29 '17 at 23:05
  • To summarize, for anyone who's reading this and confused: For the original regex, "A B" is a tricky case because there's a potential match position before B where the \s* acts like an empty string and there's a preceding space, rather than a preceding A, so the negative lookbehind doesn't forbid a match. To fix this, the altered regex makes sure that only match positions not directly after spaces can be considered. – std_answ Mar 29 '17 at 23:05
  • @wdep1 valid point! I changed "match" to "succeed" which is hopefully clearer (a negative lookaround succeeds by not matching anything). – hobbs Mar 30 '17 at 03:38