0

I've read that Windows CE uses the "UTF-16 version of UNICODE" (i'm a newbie with encodings).

What happens when a string contains a character that requires more that 2 bytes, like chinese characters ? Does it take 3 ? If i have a string containing chinese characters, accessing the N-th couple of bytes will not necessaily access the N-th visible symbol ?

Also what about performance ? If i understand well, encodings that have a variable number of bytes per visible symbol require the string to be scanned from the beginning to access the N-th visible symbol right ? If yes is it also true for UTF-16 ?

Thank you.

Virus721
  • 7,655
  • 8
  • 57
  • 115
  • See 1) [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html), 2) [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/), and 3) [UTF-16](https://en.wikipedia.org/wiki/UTF-16). – Remy Lebeau Mar 02 '15 at 02:36

1 Answers1

1

What happens when a string contains a character that requires more that 2 bytes, like Chinese characters? Does it take 3?

No, four.

Wikipedia: UTF-16:

In UTF-16, code points greater or equal to 216 are encoded using two 16-bit code units.


If I understand well, encodings that have a variable number of bytes per visible symbol require the string to be scanned from the beginning to access the N-th visible symbol right?

Yes. See for example Why use multibyte string functions in PHP?.

Community
  • 1
  • 1
CodeCaster
  • 139,522
  • 20
  • 204
  • 252