1

I'm parsing some html using regex and I want to match lines which start with a word without any html tags while also removing the white space. Using c# regex my first pattern was:

pattern = @"^\s*([^<])";

which attempts to grab all the white space and then capture any non '<' characters. Unfortunately if the line is all white space before the first '<' this returns the last white space character before the '<'. I would like this to fail the match.

Any ideas?

Jérôme
  • 2,540
  • 2
  • 25
  • 39
Patrick
  • 7,783
  • 7
  • 52
  • 71
  • Can I refer you to [my answer](http://stackoverflow.com/questions/792679/need-help-writing-regular-expression-html-parsing/792686#792686) to another similar question ? – Brian Agnew Apr 27 '09 at 10:18
  • The HTML parsing has been discussed a lot. Refer to this post: [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – Jérôme Apr 27 '09 at 10:16

2 Answers2

3

Don't use regular expressions to parse HTML. It's a really bad idea and, at best, your code will be flaky. Whatever your language/platform is you'll have a fully-functional HTML parser available. Just use that.

There is no way a regular expression can correctly handle all the cases of escaping, entity use and so on.

cletus
  • 599,013
  • 161
  • 897
  • 938
1

Asked the question to soon, just worked out this:

pattern = @"^\s*((?!\s)[^<]+)";

Thanks for the feedback about regex and html, I'll bare it in mind for the future. I'm writing a utility program to make a few pages multi-language (i.e: add asp:literals for hardcoded text etc), I think regex is sufficient for this purpose but if there are better tools please let me know (web stuff isn't my area...).

Patrick
  • 7,783
  • 7
  • 52
  • 71