How are 64-bit Unicode characters encoded?

Question

I'm trying to find a way to learn the encoding for 64-bit characters (mostly Chinese) that I encounter. For example, the encoding for '好' ("hǎo", good) is 597d. But entering: echo 好|od -t x1 in Linux Mint gives a result of: 0000000 e5 a5 bd 0a 0000004

What is the rule for translating "e5 a5 bd 0a" to "597d" ?

In my opinion, it's explained pretty well on [Wikipedia](https://en.wikipedia.org/wiki/UTF-8). Is there something specific you don't understand? Note: the `0A` byte in the end is the trailing newline character added by `echo`, so the character 好 takes only 3 bytes in UTF-8 encoding. — lenz, Feb 17 '20 at 21:58
Second note: 597d is the *code point* of 好, as defined by Unicode. It's an abstract number, not yet an encoding in a strict sense. Encodings like UTF-8 define how to express this number in a sequence of bytes. In this case it is `E5 A5 BD` (best explained at the bit level). — lenz, Feb 17 '20 at 22:15
You want to convert UTF-8 bytes to a code point, whereas the SO question [Manually converting unicode codepoints into UTF-8 and UTF-16](https://stackoverflow.com/q/6240055/2985643) asks how to do the opposite: convert a code point to UTF-8 bytes. So to do what you want, simply reverse the steps detailed in [this excellent answer](https://stackoverflow.com/a/6240184/2985643) which details how to convert code point 4E3E to UTF-8 bytes. I'm voting to close your question as a duplicate of that one (even though the questions are not identical), but please push back if that doesn't help you. — skomisa, Feb 18 '20 at 04:48
Does this answer your question? [Manually converting unicode codepoints into UTF-8 and UTF-16](https://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16) — skomisa, Feb 18 '20 at 04:49
Thanks for those who responded to the question. Used that information, plus a routine at https://stackoverflow.com/questions/17206804/hex-string-variable-to-hex-value-conversion-in-python/29433520#2943352, to process the UTF-8 as a string and turn it into a code point: — Curious George, Feb 20 '20 at 00:10

How are 64-bit Unicode characters encoded?

0 Answers0