Extract text from html source using regular expressions java

Question

I like to extract text from html page using regular expressions. Here is my code:

String regExp="<h3 class=\"field-content\"><a[^>]*>(\\w+)</a></h3>";
    Pattern regExpMatcher=Pattern.compile(regExp,Pattern.UNICODE_CHARACTER_CLASS);

    String example="<h3 class=\"field-content\"><a href=\"/humana-akcija-na-kavadarechkite-navivachi-lozari\">Проба 1</a></h3><h3 class=\"field-content\"><a href=\"/opshtina-berovo-ne-mozhe-da-sostavi-sovet-0\">Проба 2</a></h3>";
    Matcher m=regExpMatcher.matcher(example);
    while(m.find())
    {

        System.out.println(m.group(1));
    }

I like to get the values Проба 1 and Проба 2. However I only get the first value Проба 1. What is my problem?

Don't use regex for this. Use a HTML parser like [JSoup](http://jsoup.org/) — Luiggi Mendoza, Jun 09 '13 at 21:07
It is for my school project and I have to use regular expressions... — vikifor, Jun 09 '13 at 21:08
Do not use regular expressions for parsing html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Micha Wiedenmann, Jun 09 '13 at 21:11
If you can't use JSoup (or a real HTML parser) for this and you know there always be `
` handle it using String plain operations, not RegEx. — Luiggi Mendoza, Jun 09 '13 at 21:11
@MichaWiedenmann from the link: *Even [Jon Skeet](http://stackoverflow.com/users/22656/jon-skeet) cannot parse HTML using regular expressions.* this sentence made my day :). — Luiggi Mendoza, Jun 09 '13 at 21:12

score 5 · Accepted Answer · answered Jun 09 '13 at 21:20

5

It is blasphemy to use regex + HTML. But if you really want to be cursed then here it is (you have been warned):

String regExp = "<h3 class=\"field-content\"><a[^>]*>([\\w\\s]+)</a></h3>";
                                                       ^updated part

Since Проба 1 and Проба 2 contains also spaces you need to include \\s to your pattern.

answered Jun 09 '13 at 21:20

Pshemo

118,400
24
176
257

If you talk about blasphemy, you should not play devils advocate, now should you? :-) – Micha Wiedenmann Jun 09 '13 at 21:20
1

It isn't blasphemy, it is sacrilege. :) – Casimir et Hippolyte Jun 09 '13 at 21:22
I know I am playing with fire here but there is no fun without risk }:-> – Pshemo Jun 09 '13 at 21:23
@vikifor that is one of the reasons to use tools designed for such tasks like http://jsoup.org/. – Pshemo Jun 09 '13 at 21:31

score 1 · Answer 2 · answered Jun 09 '13 at 21:25

1

To discover the power of the dark side, you can try this pattern:

<h3 class=\"field-content\"><a[^>]*>([^<]+)</a></h3>

Don't forget to set the UNICODE_CASE before.

answered Jun 09 '13 at 21:25

Casimir et Hippolyte

85,718
5
90
121

Extract text from html source using regular expressions java

2 Answers2