Creating a regex with special characters in Web Harvest

Question

I am using web harvest (http://web-harvest.sourceforge.net/), the open source web scraping tool.

The regex I am trying to use has "<", ">" characters (because I am trying to strip out all HTML tags that come in). This causes a problem because the content of the elements must consist of well-formed character data or markup.

I need to somehow escape the regex, but can't figure out how.

Any ideas?

HTML parsing is a solved problem. Consider do you actually need to reinvent a solution using a regex. A mandatory SO link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — jasso, Feb 10 '11 at 21:08

score 1 · Answer 1 · answered Feb 10 '11 at 20:17

To make the regular expression well-formed XML. Try replacing < with < and > with >. Similarly if you have an & in your regular expression you will need to replace that with &.

Also I'd suggest you use an HTML parser instead of a regular expression for this task.

Creating a regex with special characters in Web Harvest

1 Answers1