-2

I have to parse text on which non-ASCII characters are encoded to a notation \X--\, where -- is the character's Unicode number. For example:

vis\XED\vel numa das imagens pr\XE9\vias \XE0\ administra\XE7\\XE3\o

Should be converted to

visível numa das imagens prévias à administração

I could do this like a Neanderthal: looking for a "\X", confirming there's a "/" 2 characters later, replacing the whole thing by the respective character, rinse and repeat until no further matches found. However, there's surely a better way to do this.

Then, I tried using regular expressions, something I don't understand nearly well enough. On RegExr I ended up with the regular expression '/\X\w{2}\/', that matched what I needed. But when I tried using it with preg_replace_callback(), specifically with the string "/\\X\w{2}\\/" as the regex, I get an "Illegal / unsupported escape sequence" error. I tried a few other regexes I found online, both on this site and elsewhere, to no avail.

Finally, I'm also not quite sure what the best way is to replace the Unicode number with the appropriate character.

So, my question is two-fold:

• What's the ideal way to find the escaped characters?

• How can I get a UTF character from its Unicode number?

sakinobashi
  • 143
  • 2
  • 10
  • 1
    One question - shouldn't your string be `administra\XE7\\XE3\o`, not `administra\XE7\XE3\o`? – El_Vanja Feb 04 '21 at 17:33
  • @El_Vanja Indeed! Thanks for pointing out my mistake. It's fixed now. – sakinobashi Feb 04 '21 at 17:36
  • As for the REGEX expression, you should double-escape backslashes to make it work. So what is a double backslash in a REGEX editor needs to become a triple one in PHP. – El_Vanja Feb 04 '21 at 17:42
  • For the second part see [this question](https://stackoverflow.com/questions/1805802/php-convert-unicode-codepoint-to-utf-8). – El_Vanja Feb 04 '21 at 17:59
  • @El_Vanja Thank you for your help! Without it it would've taken me ages to figure this out. – sakinobashi Feb 04 '21 at 19:05

1 Answers1

0

First of all, as mentioned in Right way to escape backslash [ \ ] in PHP regex? 4 slashes should be used to match a backslash. The regex, therefore, becomes "/\\\\X\\w{2}\\\\/".

As for the decoding, the easiest way I found was to convert the escaped characters to the HTML entity format and use the html_entity_decode() function. The code, therefore, ended up as follows:

function unescapeText(string $str)
{
    return preg_replace_callback(
        "/\\\\X\\w{2}\\\\/",
        fn($m) => html_entity_decode('&#x'.substr($m[0], 2, 2).';', ENT_NOQUOTES, 'UTF-8'),
        $str
    );
}

Lastly, a word of advice: I had some trouble at first because double quotes converted the string to binary; single quotes escaped double backslashes to one (\XE7\\XE3\ would, therefore, become \XE7\XE3\). That caused all sorts of issues. Using Nowdoc syntax finally made the text be interpreted literally, as I had intended.

sakinobashi
  • 143
  • 2
  • 10