0

Is it possible to strip everything that is not valid HTML markup (including comments) from a variable without regex - any hidden function like strip_tags but the opposite?

$var = "<html>" .
"<head>" .
"<script src="something"></script>" .
"<script>document.write('Hello');</script>" .
"<p>Some text</p>" .
"<!-- Comment -->" .
"Random text not in any markup." .
"</html>";

I would want $var to contain after processing:

<html>
<head>
<script src="something"></script>
<script>document.write('Hello');</script>
<p>Some text</p>
</html>
Dan
  • 1,246
  • 15
  • 37
  • 4
    http://htmlpurifier.org/ – ryder Feb 22 '13 at 14:48
  • DOMDocument should let you locate comment nodes and remove them. – GordonM Feb 22 '13 at 15:06
  • 2
    the HTML code in your example is invalid both before and after the processing. (`` not closed; no `` tag at all, etc) – SDC Feb 22 '13 at 15:28
  • 2
    Your "random text not in any markup" is inside the `html` tag... – nhahtdh Feb 22 '13 at 16:24
  • @SDC - Thanks but it's just an example of what I want. I'm not insisting on standards compliant, semantically correct markup. – Dan Feb 22 '13 at 16:27
  • @nhahtdh - ditto - I think you know what I mean. – Dan Feb 22 '13 at 16:28
  • @josef-drabek - Thanks - I'll give it a shot. – Dan Feb 22 '13 at 16:29
  • @Dan - my point was that if you want to process HTML, a HTML parser like PHP's built-in DomDocument class will do the trick for you. But if you're expecting it to work with code like your example, it won't (or it might just about work, but badly), and nor will anything else I could suggest. Regex *certainly* isn't the answer here; arbitrary HTML code is *waaay* to complex to parse with regex (see [this famous answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for more on that) – SDC Feb 22 '13 at 16:33

0 Answers0