0

I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?

I've tried Cobra's built in one and HTMLCleaner without any luck.

Legend
  • 109,064
  • 113
  • 265
  • 394
  • Judging by your last question, the problem isn't with "XPath evaluator". You were using `XPathFactory.newInstance()`, which creates the stock Java evaluator that works on any XML document loaded in a DOM model (as instance of `Document`). CORBA itself isn't an XPath evaluator - it's an HTML parser which produces `Document`, and it did that wrong in your case. So what you actually want is a "good Java HTML parser", not "good Java XPath evaluator". – Pavel Minaev Nov 26 '09 at 23:55
  • Oops... sorry. I've revised my question... I'm just going nuts with all the HTML in front of my eyes... – Legend Nov 27 '09 at 00:05
  • I'm sure this same question was on SO earlier this week... – DisgruntledGoat Nov 27 '09 at 00:36

5 Answers5

4

TagSoup is really great when dealing with crappy HTML/XHTML.

Jericho (and NekoHTML) are good too to parse non valid HTML.

TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.

Pascal Thivent
  • 549,808
  • 132
  • 1,049
  • 1,115
1

Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).

Jim Garrison
  • 83,534
  • 20
  • 149
  • 186
1

Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.

Pavel Minaev
  • 97,541
  • 25
  • 218
  • 286
1

[Answering the title - the overall question and comments are not consistsent]

JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.

peter.murray.rust
  • 36,369
  • 41
  • 146
  • 215
1

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Ms2ger
  • 15,180
  • 6
  • 35
  • 35