4

I'm learning Python's regular expression, following is working as I expected:

>>> import re
>>> re.split('\s+|:', 'find   a str:s2')
['find', 'a', 'str', 's2']

But when I change + to *, the output is weird to me:

>>> re.split('\s*|:', 'find  a str:s2')
['find', 'a', 'str:s2']

How is such pattern interpreted in Python?

Martijn Pieters
  • 963,270
  • 265
  • 3,804
  • 3,187
Deqing
  • 12,998
  • 13
  • 78
  • 120

2 Answers2

9

The 'side effect' you are seeing is that re.split() will only split on matches that are longer than 0 characters.

The \s*|: pattern matches either on zero or more spaces, or on :, whichever comes first. But zero spaces matches everywhere. In those locations where more than zero spaces matched, the split is made.

Because the \s* pattern matches every time a character is considered for splitting, the next option : is never considered.

Splitting on non-empty matches is called out explicitly in the re.split() documentation:

Note that split will never split a string on an empty pattern match.

If you reverse the pattern, : is considered, as it is the first choice:

>>> re.split(':|\s*', 'find  a str:s2')
['find', 'a', 'str', 's2']
Martijn Pieters
  • 963,270
  • 265
  • 3,804
  • 3,187
-4

If you meant to do "or" for your matching, then you have to do something like this: re.split('(\s*|:)', 'find a str:s2') In short: "+" means "at least one character". "*" any (or none)

nochkin
  • 674
  • 6
  • 16