Split string into list if separator is not enclosed

Question

I'm working on a simple wiki engine, and I am wondering if there is an efficient way to split a string into a list based on a separator, but only if that separator is not enclosed with double square brackets or double curly brackets.

So, a string like this:

"|Row 1|[[link|text]]|{{img|altText}}|"

Would get converted to a list like this:

['Row 1', '[[link|text]]', '{{img|altText}}']

EDIT: Removed the spaces from the example string, since they were causing confusion.

In your example, you use the separator `'|'` within the double curly/square brackets and at the beginning/end of the string, and the separator `' | '` otherwise. Are the spaces part of the separator or can you not assume anything about them? — mdml, Oct 05 '13 at 21:14
There are tons of MediaWiki parsers out there for Python: http://www.mediawiki.org/wiki/Alternative_parsers — Blender, Oct 05 '13 at 21:14
@mtitan8: The spaces were only added for readability, and I've removed them. — Zauber Paracelsus, Oct 05 '13 at 21:21
And I'm not writing a MediaWiki parser, I'm writing a customized parser for Creole because CreoleParser does not handle utf-8 gracefully, due to its dependency on Genshi. — Zauber Paracelsus, Oct 05 '13 at 21:23
Tried to figure out how to adapt the regexp in http://stackoverflow.com/questions/4780728/regex-split-string-preserving-quotes to the quoting here but failing to `re.compile()` with `re.VERBOSE`. — Erik Kaplun, Oct 05 '13 at 21:36
Since there is an empty string before the first `|` and after the last `|`, the result would be `['', 'Row 1', '[[link|text]]', '{{img|altText}}', '']`, wouldn't it? — Tim Pietzcker, Oct 05 '13 at 22:01

score 3 · Accepted Answer · answered Oct 05 '13 at 22:10

You can use

def split_special(subject):
    return re.split(r"""
        \|           # Match |
        (?!          # only if it's not possible to match...
         (?:         # the following non-capturing group:
          (?!\[\[)   # that doesn't contain two square brackets
          .          # but may otherwise contain any character
         )*          # any number of times,
         \]\]        # followed by ]]
        )            # End of first loohahead. Now the same thing for braces:
        (?!(?:(?!\{\{).)*\}\})""", 
        subject, flags=re.VERBOSE)

Result:

>>> s = "|Row 1|[[link|text|df[sdfl|kj]|foo]]|{{img|altText|{|}|bar}}|"
>>> split_special(s)
['', 'Row 1', '[[link|text|df[sdfl|kj]|foo]]', '{{img|altText|{|}|bar}}', '']

Note the leading and trailing empty strings - they need to be there because they do exist before your first and after your last | in the test string.

Works perfectly, thanks! Also, I was aware of the leading and trailing empty strings in the list, and was actually pulling them prior to parsing. — Zauber Paracelsus, Oct 05 '13 at 22:28

score 1 · Answer 2 · answered Oct 05 '13 at 23:25

Tim's expression is elaborate, but you can usually greatly simplify "split" expressions by converting them to "match" ones:

import re
s = "|Row 1|[[link|text|df[sdfl|kj]|foo]]|{{img|altText|{|}|bar}}|"

print re.findall(r'\[\[.+?\]\]|{{.+?}}|[^|]+', s)

# ['Row 1', '[[link|text|df[sdfl|kj]|foo]]', '{{img|altText|{|}|bar}}']

score -2 · Answer 3 · answered Oct 05 '13 at 21:13

-2

Is it possible to have Row 1|[? If the separator is always surrounded by spaces like your above example, you can do

split(" | ")

answered Oct 05 '13 at 21:13

Tommy

11,318
11
58
94

@Tommy: how about `"Bla|Bla|Bla"`? – Erik Kaplun Oct 05 '13 at 21:20
@ErikAllik notice that in my answer I *asked if it was possible to have | with no surrounding spaces. Moreover, my answer said "if" its always surrounded by spaces, so I did not attempt to handle your blah case. – Tommy Oct 05 '13 at 22:44

Split string into list if separator is not enclosed

3 Answers3