I'm trying to replace escpaed Unicode characters with the actual characters:
string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))
The expected output is ä, the actual output is ä.
I'm trying to replace escpaed Unicode characters with the actual characters:
string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))
The expected output is ä, the actual output is ä.
I'm not sure what @melpomene's criteria for "knowing what they are doing" are, but the following solution has worked previously, for example for decoding broken Hebrew text:
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
Outputs:
'ä'
This works as follows:
'\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)'\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...'latin-1'Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way.