How to split by commas that are not within parentheses?

Question

Say I have a string like this, where items are separated by commas but there may also be commas within items that have parenthesized content:

(EDIT: Sorry, forgot to mention that some items may not have parenthesized content)

"Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

How can I split the string by only those commas that are NOT within parentheses? i.e:

["Water", "Titanium Dioxide (CI 77897)", "Black 2 (CI 77266)", "Iron Oxides (CI 77491, 77492, 77499)", "Ultramarines (CI 77007)"]

I think I'd have to use a regex, perhaps something like this:

([(]?)(.*?)([)]?)(,|$)

but I'm still trying to make it work.

can you show what you have attempted so far? – C.B. Oct 29 '14 at 14:50 — C.B., Oct 29 '14 at 14:50

Avinash Raj · Accepted Answer · 2014-10-29T15:27:09.757

36

Use a negative lookahead to match all the commas which are not inside the parenthesis. Splitting the input string according to the matched commas will give you the desired output.

,\s*(?![^()]*\))

DEMO

>>> import re
>>> s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>>> re.split(r',\s*(?![^()]*\))', s)
['Water', 'Titanium Dioxide (CI 77897)', 'Black 2 (CI 77266)', 'Iron Oxides (CI 77491, 77492, 77499)', 'Ultramarines (CI 77007)']

edited Oct 29 '14 at 15:27

answered Oct 29 '14 at 15:19

Avinash Raj

166,785
24
204
249

regex101.com strikes again! :) (I just commented [here](https://stackoverflow.com/questions/29361673/python-re-sub-how-to-use-it#comment90091512_29361707) about it too an hour ago) – Shadi Jul 27 '18 at 17:17
I have a similar problem but this doesn't work for me because there are inner parentheses. For example, "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492(w), 77499), Ultramarines (CI 77007)" – maynak Feb 19 '19 at 14:45
1

This doesn't work for matching parenthesis however, try this: `s="b.buildPlanPHID,coalesce(concat('D', r.Id), concat('D',c.revisionID), concat('D', d.revisionID)) as revision_id ,d.Id as diff_id"` which should break it into 3 tokens, but it creates more. – Allen Wang May 23 '19 at 02:59
1

yep, this won't work on the string which contain parenthesis of level more than 1. – Avinash Raj Jun 25 '20 at 02:34
Was searching for a while and this is the only regex solution that worked for me – TLC Aug 24 '21 at 17:02

Vishnu Upadhyay · Answer 2 · 2014-10-30T06:04:30.080

1

You can just do it using str.replace and str.split. You may use any character to replace ),.

a = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
a = a.replace('),', ')//').split('//')
print a

output:-

['Titanium Dioxide (CI 77897)', ' Black 2 (CI 77266)', ' Iron Oxides (CI 77491, 77492, 77499)', ' Ultramarines (CI 77007)']

edited Oct 30 '14 at 06:04

answered Oct 29 '14 at 14:59

Vishnu Upadhyay

4,979
1
12
26

1

Where is the string `water` ? – Avinash Raj Oct 29 '14 at 15:23
@AvinashRaj ohh! i just missed it in my string. – Vishnu Upadhyay Oct 30 '14 at 05:41
This solution does not split off items that do not end on a parenthesis (like `Water` in the example), so the string is incorrectly split. – Victor Le Pochat Dec 04 '19 at 10:28

Marcin · Answer 3 · 2020-11-24T23:14:46.633

I believe I have a simpler regexp for this:

rx_comma = re.compile(r",(?![^(]*\))")
result = rx_comma.split(string_to_split)

Explanation of the regexp:

Match , that:
Is NOT followed by:
- A list of characters ending with ), where:
- A list of characters between , and ) does not contain (

It will not work in case of nested parentheses, like a,b(c,d(e,f)). If one needs this, a possible solution is to go through a result of split and in case of strings having an open parentheses without closing, do a merge :), like:

"a"
"b(c" <- no closing, merge this 
"d(e" <- no closing, merge this
"f))

score 0 · Answer 4 · answered Mar 18 '21 at 21:33

This version seems to work with nested parenthesis, brackets ([] or <>), and braces:

def split_top(string, splitter, openers="([{<", closers = ")]}>", whitespace=" \n\t"):
    ''' Splits strings at occurance of 'splitter' but only if not enclosed by brackets.
        Removes all whitespace immediately after each splitter.
        This assumes brackets, braces, and parens are properly matched - may fail otherwise '''

outlist = []
outstring = []

depth = 0

for c in string:
    if c in openers:
        depth += 1
    elif c in closers:
        depth -= 1

        if depth < 0:
            raise SyntaxError()

    if not depth and c == splitter:
        outlist.append("".join(outstring))
        outstring = []
    else:
        if len(outstring):
            outstring.append(c)
        elif c not in whitespace:
            outstring.append(c)

outlist.append("".join(outstring))

return outlist

Use it like this:

s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

split = split_top(s, ",") # splits on commas

It's probably not the fastest thing ever, I know.

score -1 · Answer 5 · answered Oct 29 '14 at 14:57

Try the regex

[^()]*\([^()]*\),?

code:

>>x="Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>> re.findall("[^()]*\([^()]*\),?",x)
['Titanium Dioxide (CI 77897),', ' Black 2 (CI 77266),', ' Iron Oxides (CI 77491, 77492, 77499),', ' Ultramarines (CI 77007)']

see how the regex works http://regex101.com/r/pS9oV3/1

asimoneau · Answer 6 · 2014-10-29T16:24:32.637

Using regex, this can be done easily with the findall function.

import re
s = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
re.findall(r"\w.*?\(.*?\)", s) # returns what you want

Use http://www.regexr.com/ if you want to understand regex better, and here is the link to the python documentation : https://docs.python.org/2/library/re.html

EDIT : I modified the regex string to accept content without parenthesis : \w[^,(]*(?:\(.*?\))?

How to split by commas that are not within parentheses?

6 Answers6

Linked

Related