19

Say I have a string like this, where items are separated by commas but there may also be commas within items that have parenthesized content:

(EDIT: Sorry, forgot to mention that some items may not have parenthesized content)

"Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

How can I split the string by only those commas that are NOT within parentheses? i.e:

["Water", "Titanium Dioxide (CI 77897)", "Black 2 (CI 77266)", "Iron Oxides (CI 77491, 77492, 77499)", "Ultramarines (CI 77007)"]

I think I'd have to use a regex, perhaps something like this:

([(]?)(.*?)([)]?)(,|$)

but I'm still trying to make it work.

bard
  • 2,592
  • 6
  • 31
  • 45

6 Answers6

36

Use a negative lookahead to match all the commas which are not inside the parenthesis. Splitting the input string according to the matched commas will give you the desired output.

,\s*(?![^()]*\))

DEMO

>>> import re
>>> s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>>> re.split(r',\s*(?![^()]*\))', s)
['Water', 'Titanium Dioxide (CI 77897)', 'Black 2 (CI 77266)', 'Iron Oxides (CI 77491, 77492, 77499)', 'Ultramarines (CI 77007)']
Avinash Raj
  • 166,785
  • 24
  • 204
  • 249
  • regex101.com strikes again! :) (I just commented [here](https://stackoverflow.com/questions/29361673/python-re-sub-how-to-use-it#comment90091512_29361707) about it too an hour ago) – Shadi Jul 27 '18 at 17:17
  • I have a similar problem but this doesn't work for me because there are inner parentheses. For example, "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492(w), 77499), Ultramarines (CI 77007)" – maynak Feb 19 '19 at 14:45
  • 1
    This doesn't work for matching parenthesis however, try this: `s="b.buildPlanPHID,coalesce(concat('D', r.Id), concat('D',c.revisionID), concat('D', d.revisionID)) as revision_id ,d.Id as diff_id"` which should break it into 3 tokens, but it creates more. – Allen Wang May 23 '19 at 02:59
  • 1
    yep, this won't work on the string which contain parenthesis of level more than 1. – Avinash Raj Jun 25 '20 at 02:34
  • Was searching for a while and this is the only regex solution that worked for me – TLC Aug 24 '21 at 17:02
1

You can just do it using str.replace and str.split. You may use any character to replace ),.

a = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
a = a.replace('),', ')//').split('//')
print a

output:-

['Titanium Dioxide (CI 77897)', ' Black 2 (CI 77266)', ' Iron Oxides (CI 77491, 77492, 77499)', ' Ultramarines (CI 77007)']
Vishnu Upadhyay
  • 4,979
  • 1
  • 12
  • 26
1

I believe I have a simpler regexp for this:

rx_comma = re.compile(r",(?![^(]*\))")
result = rx_comma.split(string_to_split)

Explanation of the regexp:

  • Match , that:
  • Is NOT followed by:
    • A list of characters ending with ), where:
    • A list of characters between , and ) does not contain (

It will not work in case of nested parentheses, like a,b(c,d(e,f)). If one needs this, a possible solution is to go through a result of split and in case of strings having an open parentheses without closing, do a merge :), like:

"a"
"b(c" <- no closing, merge this 
"d(e" <- no closing, merge this
"f))
Marcin
  • 3,614
  • 1
  • 25
  • 49
0

This version seems to work with nested parenthesis, brackets ([] or <>), and braces:

def split_top(string, splitter, openers="([{<", closers = ")]}>", whitespace=" \n\t"):
    ''' Splits strings at occurance of 'splitter' but only if not enclosed by brackets.
        Removes all whitespace immediately after each splitter.
        This assumes brackets, braces, and parens are properly matched - may fail otherwise '''

outlist = []
outstring = []

depth = 0

for c in string:
    if c in openers:
        depth += 1
    elif c in closers:
        depth -= 1

        if depth < 0:
            raise SyntaxError()

    if not depth and c == splitter:
        outlist.append("".join(outstring))
        outstring = []
    else:
        if len(outstring):
            outstring.append(c)
        elif c not in whitespace:
            outstring.append(c)

outlist.append("".join(outstring))

return outlist

Use it like this:

s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

split = split_top(s, ",") # splits on commas

It's probably not the fastest thing ever, I know.

nerdfever.com
  • 1,549
  • 1
  • 18
  • 33
-1

Try the regex

[^()]*\([^()]*\),?

code:

>>x="Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>> re.findall("[^()]*\([^()]*\),?",x)
['Titanium Dioxide (CI 77897),', ' Black 2 (CI 77266),', ' Iron Oxides (CI 77491, 77492, 77499),', ' Ultramarines (CI 77007)']

see how the regex works http://regex101.com/r/pS9oV3/1

nu11p01n73R
  • 25,677
  • 2
  • 36
  • 50
-1

Using regex, this can be done easily with the findall function.

import re
s = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
re.findall(r"\w.*?\(.*?\)", s) # returns what you want

Use http://www.regexr.com/ if you want to understand regex better, and here is the link to the python documentation : https://docs.python.org/2/library/re.html

EDIT : I modified the regex string to accept content without parenthesis : \w[^,(]*(?:\(.*?\))?

asimoneau
  • 658
  • 5
  • 18