3

I have a several strings that I want to split by spaces when not inside parentheses.

For example

sentence = "blah (blah2 (blah3))|blah4 blah5"

should produce

["blah", "(blah2 (blah3))|blah4", "blah5"]

I've tried:

re.split(r"\s+(?=[^()]*(?:\(|$))", sentence)

but it produces:

['blah', '(blah2', '(blah3))|blah4', 'blah5']
Optimus
  • 1,212
  • 1
  • 16
  • 34

2 Answers2

5

As said in the comments, it's impossible to process that using regex because of parenthesis nesting.

An alternative would be some good old string processing with nesting count on parentheses:

def parenthesis_split(sentence,separator=" ",lparen="(",rparen=")"):
    nb_brackets=0
    sentence = sentence.strip(separator) # get rid of leading/trailing seps

    l=[0]
    for i,c in enumerate(sentence):
        if c==lparen:
            nb_brackets+=1
        elif c==rparen:
            nb_brackets-=1
        elif c==separator and nb_brackets==0:
            l.append(i)
        # handle malformed string
        if nb_brackets<0:
            raise Exception("Syntax error")

    l.append(len(sentence))
    # handle missing closing parentheses
    if nb_brackets>0:
        raise Exception("Syntax error")


    return([sentence[i:j].strip(separator) for i,j in zip(l,l[1:])])

print(parenthesis_split("blah (blah2 (blah3))|blah4 blah5"))

result:

['blah', '(blah2 (blah3))|blah4', 'blah5']

l contains the indexes of the string where a non-paren protected space occurs. In the end, generate the array by slicing the list.

note the strip() in the end to handle multiple separator occurrences, and at the start to remove leading/trailing separators which would create empty items in the returned list.

Jean-François Fabre
  • 131,796
  • 23
  • 122
  • 195
2

While it is true that the re module cannot handle recursion, the PyPi regex module can (to some extent). Just to show how advanced regex can work, here is the 2-regex approach: one validates the balanced parentheses and the second extracts the tokens:

>>> import regex
>>> sentence = "blah (blah2 (blah3))|blah4 blah5"
>>> reg_extract = regex.compile(r'(?:(\((?>[^()]+|(?1))*\))|\S)+')
>>> reg_validate = regex.compile(r'^[^()]*(\((?>[^()]+|(?1))*\)[^()]*)+$')
>>> res = []
>>> if reg_validate.fullmatch(sentence):
    res = [x.group() for x in reg_extract.finditer(sentence)]

>>> print(res)
['blah', '(blah2 (blah3))|blah4', 'blah5']

Extraction regex details: matches 1 or more occurrences of

  • (\((?>[^()]+|(?1))*\)) - Capturing group 1 matching either 1+ chars other than ( and ) (with [^()]+) or (|) (?1) recurses the whole capturing group 1 pattern (recursion occurs)
  • | - or
  • \S - a non-whitespace char

Validation regex details:

  • ^ - start of string
  • [^()]* - 0+ chars other than ( and )
  • ( - Group 1 capturing 1 or more occurrences of:
    • \( - opening ( symbol
    • (?>[^()]+|(?1))* - 0+ occurrences of
      • [^()]+ - 1+ chars other than ( and )
      • | - or
      • (?1) - Subroutine call recursing Group 1 subpattern (recursion)
    • \) - a closing )
    • [^()]* - 0+ chars other than ( and )
  • )+ - end of Group 1
  • $ - end of string
Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476