How does Java 16 bit chars support Unicode?

Question

Javas char is 16 bit, yet Unicode have far more characters - how does Java deal with that ?

score 13 · Answer 1 · answered Dec 21 '09 at 17:59

http://en.wikipedia.org/wiki/UTF-16

In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps each character to a sequence of 16-bit words. Characters are known as code points and the 16-bit words are known as code units. For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.

And see my previous answer in SO for how to correctly iterate over all of the characters in a Java String. http://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string/1527891#1527891 — Jonathan Feinberg, Dec 21 '09 at 18:02

score 10 · Accepted Answer · answered Dec 21 '09 at 18:13

10

Java Strings are UTF-16 (big endian), so a Unicode code point can be one or two characters. Under this encoding, Java can represent the code point U+1D50A (MATHEMATICAL FRAKTUR CAPITAL G) using the chars 0xD835 0xDD0A (String literal "\uD835\uDD0A"). The Character class provides methods for converting to/from code points.

// Unicode code point to char array
char[] math_fraktur_cap_g = Character.toChars(0x1D50A);

answered Dec 21 '09 at 18:13

McDowell

105,511
29
196
262

1

Why are we not using `int` type `math_fraktur_cap_g` to read surrogate pairs of non-BMP? as mentioned [here](https://stackoverflow.com/a/13112474/3317808) – overexchange Nov 09 '17 at 06:54

score 3 · Answer 3 · answered Dec 21 '09 at 18:01

3

Java uses UTF-16 for strings - basically means that characters are variable width. Most of them fit in 16 bits, but those outside Basic Multilingual Pane occupy 32 bits. It's very similar to UTF-8 scheme.

answered Dec 21 '09 at 18:01

el.pescado - нет войне

18,313
4
45
85

How does Java 16 bit chars support Unicode?

3 Answers3

Linked

Related