2

Java's Matcher is the engine that performs match operations on a character sequence by interpreting a Pattern (Regular Expression). This class has two well known operations:

  • Matcher.find() which scans the input sequence looking for the next subsequence that matches the pattern.
  • Matcher.matches() which attempts to match the entire input sequence against the pattern.

In other words, find() should be used to match a substring whereas matches() should be used to match the entire input. This got me thinking that using find() with a Regex like ^[a-z]$ is equivalent to using matches() with a Regex like [a-z], so I went ahead and tested that.

Click here to run below code online.

import java.util.List;
import java.util.regex.Pattern;

public class Main
{
    public static void main(String[] args) {
        Pattern sub = Pattern.compile("[a-z]+");
        Pattern all = Pattern.compile("^[a-z]+$");
        List<String> tests = List.of("", "  ", "a", "A", "abc", "a\r", 
                                     "a\r\n", "a\n", " a", "\na", "\ra\n", 
                                     "\r\na", "\na");
        for (String test : tests) {
            boolean matchesSub = sub.matcher(test).matches();
            boolean matchesAll = all.matcher(test).find();
            System.out.printf("%s\t%s\t%s", format(test), matchesSub, matchesAll);
            System.out.println();
        }
    }

    private static String format(String input) {
        return input.replace("\r", "\\r").replace("\n", "\\n");
    }
}

Which produced the following output:

        false   false
        false   false
a       true    true
A       false   false
abc     true    true
a\r     false   true
a\r\n   false   true
a\n     false   true
 a      false   false
\na     false   false
\ra\n   false   false
\r\na   false   false
\na     false   false

Interestingly enough, this test fails for a\r, a\r\n and a\n:

  • using matches() with [a-z]+ on these cases produces false. Apparently the line break at the end is counted as a character, failing the test.
  • using find() with ^[a-z]+$ on these cases produces true. Apparently the line break at the end is ignored, passing the test.

This only holds true when the line break is at the end, not at the beginning though, as \r\na is treated the same by both methods.

What's going on?

Stephen C
  • 669,072
  • 92
  • 771
  • 1,162
Martin Devillers
  • 15,989
  • 5
  • 40
  • 79
  • 1
    Of course the results are different, and there is no bug. `$` matches either the end of string or the location before the final `\n` which is the last char in the string. `matches("[a-z]+")` = `find("^[a-z]+\\z")` and not `find("^[a-z]+$")`. – Wiktor Stribiżew Nov 19 '21 at 13:12
  • @WiktorStribiżew There _IS_ a bug. This behaviour isn't documented. I'm tempted to say that the bug is in the docs and not in the regex impl (similar to how until very recently, `for (int x[] : arrayOfIntArrays)` compiled and ran as you expected, but the JavaLangSpec doesn't mention it - bug in the docs, not bug in `javac`, fixed now). – rzwitserloot Nov 19 '21 at 13:33
  • **Duplicate of [Whats the difference between \z and \Z in a regular expression and when and how do I use it?](https://stackoverflow.com/questions/2707870/whats-the-difference-between-z-and-z-in-a-regular-expression-and-when-and-how)** – Wiktor Stribiżew Nov 19 '21 at 13:33
  • It is not a bug and is a common regex knowledge for any Perl-originated regex flavor. – Wiktor Stribiżew Nov 19 '21 at 13:34
  • Thank you, I understand this is common Regex knowledge for you. When I was running my tests I was going by the Java doc, which makes no mention of this special trailing newline handling for `$`. Looking around online, most resources don't mention this. .NET does a better job at describing this: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference#anchors – Martin Devillers Nov 19 '21 at 16:15

2 Answers2

3

^ and $ mean different things depending on which mode you're running your regexp in. See the Pattern.MULTILINE flag's javadoc.

In any case, ^ and $ never consume anything.

The way regex engines work, is that everything in the regexp can 'match' or 'not match' and usually as part of matching, they also consume characters.

You can think about it as a cursor that, just like your text cursor is always in between characters, and the regexp engine will go from left to right through your regexp, starting the cursor at the beginning of input, and for each item in the regexp pattern, that item either matches or fails, and usually but not always, moves the cursor forward.

^ and $ can match or fail, but they cannot move the cursor. It's the same as e.g. \b (matches on a 'word break'), or (positive/negative) look-(ahead/behind) in that way. The relevant trickery here is that for the matches() case, every character must be consumed - the matching process must end such that the cursor is at the very end. Your pattern can only consume lowercase letters (only forward the cursor when there are lowercase letters), so the moment you toss any character in your string that isn't one of those (so even one \r or \n, in any position), it couldn't possibly match; there is no way to consume these non-lowercase characters.

With find(), on the other hand, you don't need to consume all characters; you merely need for a substring to match up, that is all.

Which then gets us to: Which 'states' in the string are considered as 'matching' the ^ state, and which ones are considered as 'matching' the $ state. The answer is partly dependent on whether MULTILINE mode is on. It's off in your code snippet; you can turn it on by making your regexes using Pattern.compile(patternString, Pattern.MULTILINE), or by tossing (?m) inside your regexp string ((?xyz) enables/disables flags from the point that shows up in your pattern string, and has no effect otherwise (always matches, consumes nothing - that's regexp-engine-ese for: Doesn't do anything whatsoever).

Even the UNIX_LINES has an effect on this (with UNIX_LINES mode on, only \n is considered a line termination, and ^/$ will match whenever you're on a line termination if you're in MULTILINE mode.

In multiline mode, all your examples trivially match; ^ is 'true' anytime the cursor is either at start-of-input (the cursor is always in between characters; if it's in between the start and the first character (i.e. before the first character), it is considered to match) - or if you're in between a newline character and the thing that immediately follows it, as long as that thing isn't the end of the entire input. \r and \n all count (because UNIX_LINES is off).

But you're not in MULTILINE mode, so what in the blazes is going on?

What's going on is that the docs are wrong. As @MartinDevillers excellent digging around for the relevant bug entries shows.

The docs are only slightly wrong. Specifically, the regex engine is trying to be a little more intelligent than the rather rote:

From the javadoc of the regular expression package:

By default these expressions only match at the beginning and the end of the entire input sequence.

And that's just plain hogwash. It's more intelligent than that: They also match when your cursor is in between a character and exactly one newline, though any of \r, \n, and \r\n are all considered 'one newline', as long as that one newline is the final thing in the entire input. In other words, given (where every space isn't real; I'm making room to show where cursors can be, which can only be between chars, so I can stick a marker below them to show where things match):

" h e l l o \r \n "
           ^  ^  ^

The matching system considers $ matched in any of the ^ places. Let's test that theory:

Pattern p = Pattern.compile("hello$");
System.out.println(p.matcher("hello\r\n\n").find());
System.out.println(p.matcher("hello\r\n").find());
System.out.println(p.matcher("hello\r").find());
System.out.println(p.matcher("hello\n").find());
System.out.println(p.matcher("hello\n\n").find());

This prints false, true, true, true, false. The middle 3 all have a character (or characters) at the end that are considered 'a single newline' on at least one major OS (\n is posix/unix/macosx, \r\n is windows, \r is classic mac which I don't think ever ran a JVM, and nobody uses anymore, but its still considered 'a newline' by most rules for grandfathering reasons I guess).

That's all you're missing here.

CONCLUSION:

The docs are slightly wrong, and $ is smarter than merely 'matches at very end of input'; it acknowledges that sometimes input has a stray newline hanging off of the end of it, and $ won't get confused by this. But matches() will get confused by a dangling newline at the very end though - it has to consume everything or it isn't considered matched.

rzwitserloot
  • 65,603
  • 5
  • 38
  • 52
  • Thank you so much! This was really boggling my mind, but your explanation makes so much sense. I didn't know `$` treats a dangling newline at the end specially. This also explains why having a newline at the beginning of the input is treated differently than at the end. Thank you! – Martin Devillers Nov 19 '21 at 13:36
  • 1
    Somehow I missed how you got to the conclusion that the documentation is wrong and not the implementation. Normally, it’s the other way round, if the code does something different than specified, it’s code which is wrong. – Holger Nov 19 '21 at 14:08
  • @Holger For any mismatch between docs and implementation, you never know which one is wrong, so you use your common sense. It's quite a stretch to think that the code entirely by accident ends up accepting a newline. Thus someone wrote that intentionally, probably, and it makes sense. – rzwitserloot Nov 19 '21 at 14:12
  • 2
    [To cite Ian Graves](https://bugs.openjdk.java.net/browse/JDK-8218146?focusedCommentId=14409669#comment-14409669) who works on this bug: “*The issue appears to reveal a funny interplay between `find()` and acceptable anchor modes in the regular expression matcher. … What needs further analysis, though, is if this is acceptable behavior that needs to be clarified in the docs or if this is a bug that needs to result in a behavior change.*” It doesn’t sound as confident as your claim and the mere existence of behavior is not a proof that it was intentional, as otherwise, nothing is a bug. – Holger Nov 19 '21 at 14:30
0

As @WiktorStribiżew answered in his comment, matches() with [a-z]+ is NOT equivalent to find() with ^[a-z]+$, however it is equivalent to find() with ^[a-z]+\\z. This is because $ treats a single trailing newline as a special case: it ignores it. \z is not so forgiving.

This behavior isn't documented clearly in the official Java documentation. Moreover, there's an open bug report in the JDK currently under investigation which specifically deals with the $ matcher, trailing newlines and the find() method. Also, judging by these other older reports it's at the minimum confusing: JDK-8218146 JDK-8059325 JDK-8058923 JDK-8049849 JDK-8043255

Finally, this behavior is not the same in all RegEx implementations:

In all major engines except JavaScript, if the string has one final line break, the $ anchor can match there. For instance, in the apple\n, e$ matches the final e.

Martin Devillers
  • 15,989
  • 5
  • 40
  • 79
  • Curiously, from that list of bugs you mentioned only the first one is open. So, I'm not sure this justifies the argument that the conditions in the question are indeed a bug. – Edwin Dalorzo Nov 19 '21 at 13:32
  • 3
    @EdwinDalorzo all four reports were assigned to the same person who closed the first three and still insists on this behavior not to be a bug on the fourth. More interesting is [this newer bug report](https://bugs.openjdk.java.net/browse/JDK-8218146) which has been assigned to a different person and has status “in progress”. – Holger Nov 19 '21 at 13:40
  • Thank you both for your replies. I've taken out the "this is a bug!" statement from my answer as the situation is more subtle than that. I did some more digging and, for instance, found out that JavaScript doesn't do this handling. .NET does, but then doesn't treat `\r` as a line break. Tricky stuff. – Martin Devillers Nov 19 '21 at 16:30