2

I tried several methods to find out what part of a html string is invalid

$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);

None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?

I need this to manually fix html input from users. I don't want to relay on automated processes.

johnlemon
  • 19,827
  • 38
  • 117
  • 176

2 Answers2

3

I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.

error_reporting(0);

$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';

$doc = new DOMDocument();
$doc->encoding = 'UTF-8';

$doc->loadHTML($badHTML);

$goodHTML = simplexml_import_dom($doc)->asXML();
Nev Stokes
  • 7,531
  • 4
  • 39
  • 43
1

You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.

Community
  • 1
  • 1
jcubic
  • 56,835
  • 46
  • 206
  • 357