1

I'm trying to split a string using a variety of characters as delimiters and also keep those delimiters in their own array index. For example say I want to split the string:

if (x>1) return x * fact(x-1);

using '(', '>', ')', '*', '-', ';' and '\s' as delimiters. I want the output to be the following string array: {"if", "(", "x", ">", "1", ")", "return", "x", "*", "fact", "(", "x", "-", "1", ")", ";"}

The regex I'm using so far is split("(?=(\\w+(?=[\\s\\+\\-\\*/<(<=)>(>=)(==)(!=)=;,\\.\"\\(\\)\\[\\]\\{\\}])))")

which splits at each word character regardless of whether it is followed by one of the delimiters. For example

test + 1

outputs {"t","e","s","t+","1"} instead of {"test+", "1"}

Why does it split at each character even if that character is not followed by one of my delimiters? Also is a regex which does this even possible in Java? Thank you

Paul Bellora
  • 53,024
  • 17
  • 128
  • 180
user1731199
  • 197
  • 1
  • 1
  • 10
  • possible duplicate of [Is there a way to split strings with String.split() and include the delimiters?](http://stackoverflow.com/questions/275768/is-there-a-way-to-split-strings-with-string-split-and-include-the-delimiters) and [How to split a string, but also keep the delimiters?](http://stackoverflow.com/questions/2206378/how-to-split-a-string-but-also-keep-the-delimiters) – Paul Bellora Nov 14 '12 at 05:49

3 Answers3

5

Well, you can use lookaround to split at points between characters without consuming the delimiters:

(?<=[()>*-;\s])|(?=[()>*-;\s])

This will create a split point before and after each delimiter character. You might need to remove superfluous whitespace elements from the resulting array, though.

Quick PowerShell test (| marks the split points):

PS Home:\> 'if (x>1) return x * fact(x-1);' -split '(?<=[()>*-;\s])|(?=[()>*-;\s])' -join '|'
if| |(|x|>|1|)| |return| |x| |*| |fact|(|x|-|1|)|;|
Joey
  • 330,812
  • 81
  • 665
  • 668
0

How about this pattern?

(\w+)|([\p{P}\p{S}])
Sina Iravanian
  • 15,521
  • 4
  • 30
  • 44
0

To answer your question, "Why?", it's because your entire expression is a lookahead assertion. As long as that assertion is true at each character (or maybe I should say "between"), it is able to split.

Also, you cannot group within character classes, e.g. (<=) is not doing what you think it is doing.

slackwing
  • 27,451
  • 15
  • 82
  • 136