6

How to convert a Unicode string to HTML entities? (HEX not decimal)

For example, convert Français to Français.

mrdaliri
  • 6,816
  • 22
  • 70
  • 102
  • 1
    What do you need this for? It *should* never be necessary.... – Pekka Nov 07 '12 at 23:47
  • 4
    It depends on which unicode encoding in specific. [`mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');`](http://stackoverflow.com/a/11310258/367456) for example works for UTF-8 unicode strings in PHP. If you *need* hex encodings the linked answer shows you how to capture all those (from utf-8 strings) you only need to run your hex encoding. – hakre Nov 07 '12 at 23:56
  • @hakre: string is `UTF-8`. `mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');` convert to decimal, but I want `hex` code. – mrdaliri Nov 08 '12 at 00:04
  • 1
    Your question is not very precise. I think if I take it right, the output is `Français` and not `Français`. – hakre Nov 08 '12 at 00:15
  • possible duplicate of [Get hexcode of html entities](http://stackoverflow.com/questions/7482977/get-hexcode-of-html-entities) – hakre Nov 08 '12 at 00:17
  • 2
    @Pekka웃 - I've just found a vendor API in 2015 that requires plain US-ASCII XML requests to process a Unicode-related feature. *sigh* – Álvaro González Apr 24 '15 at 07:48
  • @ÁlvaroG.Vicario argh!!! – Pekka Apr 24 '15 at 07:56

5 Answers5

11

For the missing hex-encoding in the related question:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return $entity;
}, $input);

This is similar to @Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.

If you prefer iconv over mb_convert_encoding, it's similar:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = iconv('UTF-8', 'UTF-32BE', $utf8);
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return $entity;
}, $input);

I find this string manipulation a bit more clear then in Get hexcode of html entities.

Community
  • 1
  • 1
hakre
  • 184,866
  • 48
  • 414
  • 792
  • Splendid! I use this to code back CKEditors output converting my html entities to unicode symbols. – Daniel Mar 26 '16 at 12:45
  • This helped me display emojis on a ISO-8859-1 website. First I convert to hex using this approach, then I can save it in the db and display it in both the website, and the webview in an app. Very nice. – Jette Aug 26 '17 at 21:03
8

Your string looks like UCS-4 encoding you can try

$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
    $char = current($m);
    $utf = iconv('UTF-8', 'UCS-4', $char);
    return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);

Output

string 'Français' (length=13)
Baba
  • 92,047
  • 28
  • 163
  • 215
4

Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.

Would like to document an alternative solution because it solved a similar problem for me. I was using PHP's utf8_encode() to escape 'special' characters.

I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)

$foo = 'This is my test string \u03b50';
echo unicode2html($foo);

function unicode2html($string) {
    return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}

Hope this helps somebody in need :-)

Angad
  • 2,733
  • 3
  • 29
  • 43
0

See How to get the character from unicode code point in PHP? for some code that allows you to do the following :

Example use :

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

Output :

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC int
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
Community
  • 1
  • 1
John Slegers
  • 41,615
  • 22
  • 193
  • 161
0

You can also use mb_encode_numericentity which is supported by PHP 4.0.6+ (link to PHP doc).

function unicode2html($value) {
    return mb_encode_numericentity($value, [
    //  start codepoint
    //  |       end codepoint
    //  |       |       offset
    //  |       |       |       mask
        0x0000, 0x001F, 0x0000, 0xFFFF,
        0x0021, 0x002C, 0x0000, 0xFFFF,
        0x002E, 0x002F, 0x0000, 0xFFFF,
        0x003C, 0x003C, 0x0000, 0xFFFF,
        0x003E, 0x003E, 0x0000, 0xFFFF,
        0x0060, 0x0060, 0x0000, 0xFFFF,
        0x0080, 0xFFFF, 0x0000, 0xFFFF
    ], 'UTF-8', true);
}

In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.

Usage example:

$input = array(
    '"Meno più, PIÙ o meno"',
    '\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
    '<script>alert("XSS");</script>',
    '"`'
);

$output = array();
foreach ($input as $str)
    $output[] = unicode2html($str)

Result:

$output = array(
    '&#x22;Meno pi&#xF9;&#x2C; PI&#xD9; o meno&#x22;',
    '&#x27;&#xC0;&#xCC;&#xD9;&#xD2;L&#xC8; PERCH&#xC9; perch&#xE9; &#xE8; sempre cos&#xEC; non si s&#xE0;&#x27;',
    '&#x3C;script&#x3E;alert&#x28;&#x22;XSS&#x22;&#x29;;&#x3C;&#x2F;script&#x3E;',
    '&#x22;&#x60;'
);
Marco Sacchi
  • 679
  • 6
  • 19