1

I have this piece of HTML:

</TABLE>
<HR>
<font size="+1"> Method and apparatus for re-sizing and zooming images by operating directly
     on their digital transforms
</font><BR>

and I am trying to capture the text inside font tag. This is my Regex:

  Regex regex = new Regex("</TABLE><HR><font size=\"+1\">(?<title>.*?)</font><BR>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

        Match match = regex.Match(data);

        string title = match.Groups["title"].Value;

However I get empty title. Can anybody tell me what am I missing?

Alan Moore
  • 71,299
  • 12
  • 93
  • 154
Jack
  • 7,173
  • 18
  • 59
  • 105
  • A regex is the wrong tool for this. Regexes cannot parse HTML (or XML) with any degree of reliability. Use an HTML parser, and see [this question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Richard Aug 12 '12 at 11:38
  • @Richard: I understand this. However the website that I want to parse has a fixed structure and so I want to use Regex itself. – Jack Aug 12 '12 at 11:40

1 Answers1

3

Your regex;

new Regex("</TABLE><HR><font size=\"+1\">(?<title>.*?)</font><BR>"

isn't well formed since + has a distinct meaning in regex.

Based on your input string, what you want is really to have it escaped;

new Regex("</TABLE><HR><font size=\"\\+1\">(?<title>.*?)</font><BR>"

Also, if you want to match strings with newlines, you have to give a wildcard to ignore them too, so this may be even more what you're trying to do;

new Regex("</TABLE>.*<HR>.*<font size=\"\\+1\">(?<title>.*?)</font>.*<BR>"
Joachim Isaksson
  • 170,943
  • 22
  • 265
  • 283
  • Thanks. But didn't understand why you did .* for multiline? Wouldn't it match everything when it is RegexOptions.Singleline? – Jack Aug 12 '12 at 12:20
  • @Jack RegexOptions.Singleline only *changes the meaning of the dot (.) so it matches every character (instead of every character except \n).* In other words, you still need to match a linefeed with . or .* to ignore it. – Joachim Isaksson Aug 12 '12 at 12:22