4

UTF-8 can represent each character by one byte or more. Let's suppose that I have the following byte sequence:

48 65

How can I know if it's one character represented by 48 and another character represented by 65, or it's ONE character represented by a combination of TWO bytes 48 65?

Remy Lebeau
  • 505,946
  • 29
  • 409
  • 696
CrazySynthax
  • 11,486
  • 28
  • 82
  • 152
  • Possible duplicate of [Detect UTF-8 encoding (How does MS IDE do it)?](https://stackoverflow.com/questions/11479143/detect-utf-8-encoding-how-does-ms-ide-do-it) –  Aug 02 '17 at 15:43
  • 2
    Because the [most significant bits in the first byte of a codepoint tell a UTF-8 decoder how many bytes make up a codepoint](https://en.wikipedia.org/wiki/UTF-8). – Phylogenesis Aug 02 '17 at 15:43
  • 1
    Also, you should be careful with your terminology when it comes to Unicode. What you're talking about here is individual 'code points'. What you probably consider to be a character (or [grapheme cluster](http://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html#unicode.introduction_to_unicode.grapheme_clusters)) can be made up of an arbitrary number of individual code points. For instance, the character `é` can be encoded as `U+00E9` ('LATIN SMALL LETTER E WITH ACUTE), or as `U+0065` (LATIN SMALL LETTER E) followed by `U+0301` (COMBINING ACUTE ACCENT). – Phylogenesis Aug 03 '17 at 08:12

1 Answers1

4

UTF-8 was designed in such a way as to be unambiguous. Neither 0x48 or 0x65, or anything else under 0x80, are ever part of a multi-byte sequence.

The most significant bits of the first byte of a UTF-8 encoded code point will tell you how many bytes are used for it. This should be clear from the UTF-8 Bit Distribution Table:

Scalar Value                First Byte  Second Byte Third Byte  Fourth Byte
00000000 0xxxxxxx           0xxxxxxx            
00000yyy yyxxxxxx           110yyyyy    10xxxxxx        
zzzzyyyy yyxxxxxx           1110zzzz    10yyyyyy    10xxxxxx    
000uuuuu zzzzyyyy yyxxxxxx  11110uuu    10uuzzzz    10yyyyyy    10xxxxxx

So, the worst case scenario is you jump in mid string somewhere and see a byte whose most significant bits are 1 then 0 (everything from 0x80 through 0xBF), which says it's a continuation byte. In that case, you'd have to backtrack a maximum of 3 bytes in order to determine the full sequence.

user3942918
  • 24,679
  • 11
  • 53
  • 67