5

I'm trying to write a regular expression for Java that matches if there is a semicolon that does not have two (or more) leading '-' characters.

I'm only able to get the opposite working: A semicolon that has at least two leading '-' characters.

([\-]{2,}.*?;.*)

But I need something like

([^([\-]{2,})])*?;.*

I'm somehow not able to express 'not at least two - characters'.

Here are some examples I need to evaluate with the expression:

; -- a           : should match
-- a ;           : should not match
-- ;             : should not match
--;              : should not match
-;-              : should match
---;             : should not match
-- semicolon ;   : should not match
bla ; bla        : should match
bla              : should not match (; is mandatory)
-;--;            : should match (the first occuring semicolon must not have two or more consecutive leading '-')
Ben Voigt
  • 269,602
  • 39
  • 394
  • 697
Richard
  • 572
  • 5
  • 18
  • How many semicolons can be in string? Is string like `-;--;` correct? – Pshemo Jul 21 '14 at 15:04
  • Also do we want to forbid only leading `-`? What about strings like `x--;`? – Pshemo Jul 21 '14 at 15:06
  • @Pshemo The first one has to match (updated my question accordingly). The second one must not match, just to keep things simple. Otherwise I would need to write a complete parser and thats not the intention of my small application. – Richard Jul 21 '14 at 15:13

5 Answers5

2

It seems that this regex matches what you want

String regex = "[^-]*(-[^-]+)*-?;.*";

DEMO

Explanation: matches will accept string that:

  • [^-]* can start with non dash characters
  • (-[^-]+)*-?; is a bit tricky because before we will match ; we need to make sure that each - do not have another - after it so:
    • (-[^-]+)* each - have at least one non - character after it
    • -? or - was placed right before ;
  • ;.* if earlier conditions ware fulfilled we can accept ; and any .* characters after it.

More readable version, but probably little slower

((?!--)[^;])*;.*

Explanation:

To make sure that there is ; in string we can use .*;.* in matches.
But we need to add some conditions to characters before first ;.

So to make sure that matched ; will be first one we can write such regex as

[^;]*;.*

which means:

  • [^;]* zero or more non semicolon characters
  • ; first semicolon
  • .* zero or more of any characters (actually . can't match line separators like \n or \r)

So now all we need to do is make sure that character matched by [^;] is not part of --. To do so we can use look-around mechanisms for instance:

  • (?!--)[^;] before matching [^;] (?!--) checks that next two characters are not --, in other words character matched by [^;] can't be first - in series of two --
  • [^;](?<!--) checks if after matching [^;] regex engine will not be able to find -- if it will backtrack two positions, in other words [^;] can't be last character in series of --.
Pshemo
  • 118,400
  • 24
  • 176
  • 257
0

You need a negative lookahead!

This regex will match any string which does not contain your original match pattern:

(?!-{2,}.*?;.*).*?;.*

This Regex matches a string which contains a semicolon, but not one occuring after 2 or more dashes.

Example: Regex Working

Adam Yost
  • 3,534
  • 21
  • 36
0

How about using this regex in Java:

[^;]*;(?<!--[^;]{0,999};).*

Only caveat is that it works with up to 999 character length between -- and ;

Java Regex Demo

Community
  • 1
  • 1
anubhava
  • 713,503
  • 59
  • 514
  • 593
  • Thanks, @anubhava. I would prefer a solution without groups. But in general your solution works. – Richard Jul 21 '14 at 15:16
  • 1
    Hi anubhava I have created a question almost dedicated to you! lol. I really like that tecnhique. Could you check it? http://stackoverflow.com/questions/24808793/regex-technique-to-disallow-variable-length-lookbehind-using-or/ – Federico Piazza Jul 21 '14 at 15:21
0

How about just splitting the string along -- and if there are two or more sub strings, checking if the last one contains a semicolon?

Edwin Buck
  • 67,527
  • 7
  • 97
  • 130
  • 1
    I would prefer a solution where I can call just .matches(), because all other statements in my class work this way. Just for reading purposes. – Richard Jul 21 '14 at 15:18
  • @RichardW. That's fine; however, if it is for readability's sake, some of these answers are far less readable (if you care about _understandability_) than two or three calls which make the task obvious. – Edwin Buck Jul 21 '14 at 15:56
0

I think this is what you're looking for:

^(?:(?!--).)*;.*$

In other words, match from the start of the string (^), zero or more characters (.*) followed by a semicolon. But replacing the dot with (?:(?!--).) causes it to match any character unless it's the beginning of a two-hyphen sequence (--).

If performance is an issue, you can exclude the semicolon as well, so it never has to backtrack:

^(?:(?!--|;).)*;.*$

EDIT: I just noticed your comment that the regex should work with the matches() method, so I padded it out with .*. The anchors aren't really necessary, but they do no harm.

Alan Moore
  • 71,299
  • 12
  • 93
  • 154