2

I'm trying to replace escpaed Unicode characters with the actual characters:

string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä, the actual output is ä.

Toast
  • 553
  • 2
  • 20
  • 39
  • 1
    Those don't look like escaped Unicode characters. It's more like someone took a Unicode string, encoded it as UTF-8, then treated it as a Unicode string again and encoded *that*. – melpomene Sep 22 '18 at 20:37
  • Can you suggest a way of reversing this process? – Toast Sep 22 '18 at 20:38
  • 1
    Sorry, I don't know Python. `string.encode("ascii").decode("unicode-escape").encode("latin-1").decode("utf-8")` seems to do something, but that's just guesswork. You should probably wait until someone shows up who knows what they're doing. – melpomene Sep 22 '18 at 20:39
  • That worked! If you want to post it as an answer, I'll accept it. Thank you! – Toast Sep 22 '18 at 20:40
  • It looks a little bit like an XY-problem. [The previous question](https://stackoverflow.com/questions/52457095/convert-unicode-escape-to-hebrew-text) in the [unicode] tag shows exactly the same kind of broken text. Could you maybe share where you got this broken text it in the first place? – Andrey Tyukin Sep 22 '18 at 21:47
  • 1
    @AndreyTyukin I found the text inside a Facebook data takeout archive. https://www.facebook.com/help/1701730696756992 – Toast Sep 22 '18 at 23:34
  • 2
    Then you are already the second person today with the same encoding problem in the facebook JSON data. That's strange... Ah! Then it seems that your question is actually an XY-problem-wise duplicate of this: [Facebook JSON badly encoded](https://stackoverflow.com/questions/50008296/facebook-json-badly-encoded). Martijn Pieters also confirms that it looks like mojibake. – Andrey Tyukin Sep 22 '18 at 23:43

1 Answers1

2

I'm not sure what @melpomene's criteria for "knowing what they are doing" are, but the following solution has worked previously, for example for decoding broken Hebrew text:

("\\u00c3\\u00a4"
  .encode('latin-1')
  .decode('unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way.

Andrey Tyukin
  • 41,173
  • 4
  • 44
  • 82