1

I'm working on a simple wiki engine, and I am wondering if there is an efficient way to split a string into a list based on a separator, but only if that separator is not enclosed with double square brackets or double curly brackets.

So, a string like this:

"|Row 1|[[link|text]]|{{img|altText}}|"

Would get converted to a list like this:

['Row 1', '[[link|text]]', '{{img|altText}}']

EDIT: Removed the spaces from the example string, since they were causing confusion.

BenMorel
  • 31,815
  • 47
  • 169
  • 296
Zauber Paracelsus
  • 1,164
  • 1
  • 12
  • 20
  • In your example, you use the separator `'|'` within the double curly/square brackets and at the beginning/end of the string, and the separator `' | '` otherwise. Are the spaces part of the separator or can you not assume anything about them? – mdml Oct 05 '13 at 21:14
  • There are tons of MediaWiki parsers out there for Python: http://www.mediawiki.org/wiki/Alternative_parsers – Blender Oct 05 '13 at 21:14
  • @mtitan8: The spaces were only added for readability, and I've removed them. – Zauber Paracelsus Oct 05 '13 at 21:21
  • And I'm not writing a MediaWiki parser, I'm writing a customized parser for Creole because CreoleParser does not handle utf-8 gracefully, due to its dependency on Genshi. – Zauber Paracelsus Oct 05 '13 at 21:23
  • Tried to figure out how to adapt the regexp in http://stackoverflow.com/questions/4780728/regex-split-string-preserving-quotes to the quoting here but failing to `re.compile()` with `re.VERBOSE`. – Erik Kaplun Oct 05 '13 at 21:36
  • Since there is an empty string before the first `|` and after the last `|`, the result would be `['', 'Row 1', '[[link|text]]', '{{img|altText}}', '']`, wouldn't it? – Tim Pietzcker Oct 05 '13 at 22:01
  • Aye, though that's taken care of prior to parsing. – Zauber Paracelsus Oct 05 '13 at 22:29

3 Answers3

3

You can use

def split_special(subject):
    return re.split(r"""
        \|           # Match |
        (?!          # only if it's not possible to match...
         (?:         # the following non-capturing group:
          (?!\[\[)   # that doesn't contain two square brackets
          .          # but may otherwise contain any character
         )*          # any number of times,
         \]\]        # followed by ]]
        )            # End of first loohahead. Now the same thing for braces:
        (?!(?:(?!\{\{).)*\}\})""", 
        subject, flags=re.VERBOSE)

Result:

>>> s = "|Row 1|[[link|text|df[sdfl|kj]|foo]]|{{img|altText|{|}|bar}}|"
>>> split_special(s)
['', 'Row 1', '[[link|text|df[sdfl|kj]|foo]]', '{{img|altText|{|}|bar}}', '']

Note the leading and trailing empty strings - they need to be there because they do exist before your first and after your last | in the test string.

Tim Pietzcker
  • 313,408
  • 56
  • 485
  • 544
  • Works perfectly, thanks! Also, I was aware of the leading and trailing empty strings in the list, and was actually pulling them prior to parsing. – Zauber Paracelsus Oct 05 '13 at 22:28
1

Tim's expression is elaborate, but you can usually greatly simplify "split" expressions by converting them to "match" ones:

import re
s = "|Row 1|[[link|text|df[sdfl|kj]|foo]]|{{img|altText|{|}|bar}}|"

print re.findall(r'\[\[.+?\]\]|{{.+?}}|[^|]+', s)

# ['Row 1', '[[link|text|df[sdfl|kj]|foo]]', '{{img|altText|{|}|bar}}']
georg
  • 204,715
  • 48
  • 286
  • 369
-2

Is it possible to have Row 1|[? If the separator is always surrounded by spaces like your above example, you can do

split(" | ")
Tommy
  • 11,318
  • 11
  • 58
  • 94
  • @Tommy: how about `"Bla|Bla|Bla"`? – Erik Kaplun Oct 05 '13 at 21:20
  • @ErikAllik notice that in my answer I *asked if it was possible to have | with no surrounding spaces. Moreover, my answer said "if" its always surrounded by spaces, so I did not attempt to handle your blah case. – Tommy Oct 05 '13 at 22:44