0

I have text that may contain HTML islands.

Example:

qwwdeadaskdfdaskjfhbsdfkf<a href="/cookbook/modifying-data/set-attributes">Set attribute values</a>gfkjgfkjrgjgjgjgjgroggjrog <b>jsoup</b>sdflkjsdfsfklsfklfjsfkljsfljsf<a href="/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String)" title="Parse HTML into a Document.">Jsoup.parse(String html)</a>skgjdfgkjdfgkldfjgdfkgljdfg

How can I extract those HTML fragments?

gen_Eric
  • 214,658
  • 40
  • 293
  • 332
balderman
  • 21,028
  • 6
  • 30
  • 43

2 Answers2

0

Java supports both DOM and SAX parsing for XML, however they both require the document to be well-formed. Therefore your example would not be parsed. There is a project called NekoHTML (http://nekohtml.sourceforge.net/) that supports scanning non well-formed HTML.

LINEMAN78
  • 2,544
  • 16
  • 19
0

I do exactly what you are asking -- find HTML fragments in a chunk of text -- by wrapping an enclosing tag around the text then using a java.xml.parsers.DocumentBuilder to create a DOM tree.

The basic idea (and omitting much) is just

String fragment = "<wrap_node>" + orig_text + "</wrap_node>";
Document d = builder.parse(fragment);

If tags aren't well-formed... missing end, improper nesting, etc. ... this won't work, but this works for me because I want to reject anything malformed.

Stephen P
  • 13,862
  • 2
  • 43
  • 65