0

I am using web harvest (http://web-harvest.sourceforge.net/), the open source web scraping tool.

The regex I am trying to use has "<", ">" characters (because I am trying to strip out all HTML tags that come in). This causes a problem because the content of the elements must consist of well-formed character data or markup.

I need to somehow escape the regex, but can't figure out how.

Any ideas?

kburns
  • 762
  • 2
  • 7
  • 21
  • HTML parsing is a solved problem. Consider do you actually need to reinvent a solution using a regex. A mandatory SO link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – jasso Feb 10 '11 at 21:08

1 Answers1

1

To make the regular expression well-formed XML. Try replacing < with &lt; and > with &gt;. Similarly if you have an & in your regular expression you will need to replace that with &amp;.

Also I'd suggest you use an HTML parser instead of a regular expression for this task.

Mark Byers
  • 767,688
  • 176
  • 1,542
  • 1,434