1

Given a DOMDocument constructed with a stylesheet that contains an emoji character like so:

$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><head><meta charset=utf-8><style>span::before{ content: \"⚡️\"; }</style></head><body><span></span></body></html>" );

I've found some strange behavior when serializing the DOM back out to HTML.

If I do $dom->saveHTML( $dom->documentElement ) then I get (as desired):

<html><head><meta charset="utf-8">
<style>span::before{ content: "⚡️"; }</style>
</head><body><span></span></body></html>

However, if I instead do $dom->saveHTML() to save the entire document I get (erroneously):

<html><head><meta charset="utf-8">
<style>span::before{ content: "&#9889;&#65039;"; }</style>
</head><body><span></span></body></html>

Notice how the emoji “⚡️” is encoded as the HTML entities &#9889;&#65039; inside of the stylesheet, and browsers do not like this and it is treated as a literal string since CSS escape \26A1 should be used in instead.

I tried setting $dom->substituteEntities = false but without any effect.

The same HTML entity conversion is also happening inside of script tags, which causes similar problems in browsers.

Test via online PHP shell: https://3v4l.org/jMfDd

Weston Ruter
  • 971
  • 1
  • 9
  • 20
  • One piece to the puzzle is that libxml doesn't seem to recognize ``. If I replace it with `` then the entity encoding is not performed. However, this still doesn't explain why the behavior differs between `$dom->saveHTML()` and `$dom->saveHTML( $dom->documentElement )`. – Weston Ruter Aug 02 '18 at 18:45
  • If you provide a node the result is a serialized fragment, not a whole document. That might be a reason for the different behavior. – ThW Aug 04 '18 at 17:16
  • I have [an answer](https://stackoverflow.com/a/59940487/1255289) for what's happening here. I can only guess at _why_ it's happening though. – miken32 Jan 28 '20 at 00:17

0 Answers0