-2

I want to split a string by boundaries between non-repeating characters with Python. I wrote this regex:

(?<=(.))(?!\\1)', string)

So I expecting "aaab447777BBBBbbb" will be splitted to ['aaa', 'b', '44', '7777', 'BBBB', ''bbb]

I used the same regex in Java and got the desired result. Unfortunately, this does not work in Python. When I try

re.split('(?<=(.))(?!\\1)', string)

the result is

['aaa', 'a', 'b', 'b', '44', '4', '7777', '7', 'BBBB', 'B', 'bbb', 'b', '']

When I do

re.findall('(?<=(.))(?!\\1)', string)

returns

['a', 'b', '4', '7', 'B', 'b']

Why doesn't Python understand the regular expression that Java understands and how to solve the problem in Python?

Flimzy
  • 68,325
  • 15
  • 126
  • 165
Igor Gindin
  • 7
  • 1
  • 3

2 Answers2

0

If you're open to non-regex solutions, this is a perfect application for itertools.groupby

>>> [''.join(g) for k, g in groupby('aaab447777BBBBbbb')]
['aaa', 'b', '44', '7777', 'BBBB', 'bbb']
Cory Kramer
  • 107,498
  • 14
  • 145
  • 201
0

The "why" is simple: in Python, as also in Perl and Ruby and JavaScript, using a capture group in the pattern passed to split means that you want whatever is captured there to be included in the returned array. This is useful when you want to allow multiple delimiters but be able to tell which was used in each position. It does, however, mean that you get extra results if you're trying to do something fancy like your example. Your regex has to capture each repeated character in order to detect its repetition, but split can't tell that those captures aren't for its benefit, so it includes those single-character strings in the returned array.

This result is completely predictable, though. The returned array will include the sections you want followed by the single characters that they consist of repetitions of. So you can always take just the even elements to get your desired result:

>>> re.split(r'(?<=(.))(?!\1)', string)[::2]
['aaa', 'b', '44', '7777', 'BBBB', 'bbb', '']

(It's a good idea to use "raw" strings (r'...') for regexes so you don't have to double all your backslashes and quadruple all your backslashed backslashes...)

But that combination of positive lookbehind and negative lookahead with split seems overly complex for what you're doing here; that's the sort of thing you normally only resort to in Java when trying to emulate the "capture delimiters" behavior from these other languages. I think something like this would be easier to understand:

>>> [m[0] for m in re.finditer(r'(.)\1*', string)]
['aaa', 'b', '44', '7777', 'BBBB', 'bbb']
Mark Reed
  • 86,341
  • 15
  • 131
  • 165