5

Why does this regex work in Python but not in Ruby:

/(?<!([0-1\b][0-9]|[2][0-3]))/

Would be great to hear an explanation and also how to get around it in Ruby

EDIT w/ the whole line of code:

re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)

Basically, I'm trying to add '\n' when there is a colon and it is not a time.

mrzasa
  • 22,227
  • 11
  • 53
  • 93
echan00
  • 2,726
  • 2
  • 16
  • 30

3 Answers3

4

Ruby regex engine doesn't allow capturing groups in look behinds. If you need grouping, you can use a non-capturing group (?:):

[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/

Docs:

 (?<!subexp)        negative look-behind

                     Subexp of look-behind must be fixed-width.
                     But top-level alternatives can be of various lengths.
                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

                     In negative look-behind, capturing group isn't allowed,
                     but non-capturing group (?:) is allowed.

Learned from this answer.

mrzasa
  • 22,227
  • 11
  • 53
  • 93
  • I saw the capture groups causing error in lookbehinds, but when I change it to non-capture groups it times out. –  Jul 18 '19 at 22:19
  • Should `(?<=aaa(?:b|cd))` be `(?<=aaa(b|cd))` that is not allowed ? –  Jul 18 '19 at 22:20
  • What is the string it times out on? Maybe there is some excessive bactracking – mrzasa Jul 18 '19 at 22:21
  • It times out on `(? –  Jul 18 '19 at 22:23
  • What is the string that you're matching with this regex? – mrzasa Jul 18 '19 at 22:26
  • It works fast on my computer: `[10] pry(main)> /(? 0` – mrzasa Jul 18 '19 at 22:30
  • I guess from that answer, the non-capture group in this `(?<=aaa(?:b|cd))` is not the _top level_ but it should be for `(?<=(?:a|bc))` right ? But, if only that's allowed, then you don't need `(?: )` ever at all. –  Jul 18 '19 at 22:32
  • That's why I don't trust online ruby testers. –  Jul 18 '19 at 22:32
  • I'm gonna have to rate Ruby as having a bizzaro world regex engine. haha –  Jul 18 '19 at 22:36
2

Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.

Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.

Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.

Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.

Regex demo

  • \b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
    • \b - word boundary
    • ((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
      • (?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
      • :[0-5][0-9] - : and then a number from 00 to 59
      • \s* - 0+ whitespaces
      • [pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
  • | - or
  • \b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot

Python fixed solution:

import re
text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)

Ruby solution:

text = 'am pm  P.M.  10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }

Output:

"\n \n  \n  10:56pm 10:43 a.m."
Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
0

For sure @mrzasa found the problem out.

But .. Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.

(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)

PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n

Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n

Readable version

 (?i)
 (?<!
      \b [01] [0-9] 
 )
 (?<!
      \b [2] [0-3] 
 )
 (                             # (1 start)
      [^\S\r\n]* 
      :
 )                             # (1 end)
 [^\S\r\n]* 
 (?!
      [0-5] [0-9] 
      (?: [ap] \.? m \b \.? )?
 )