Can I get a single canonical UTF-8 string from a Unicode string?

Question

I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there's one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ~~ASCII~~ byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.

I'm trying to determine whether UTF-8 will do the trick or not. I've heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.

So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?

*"the exact same ASCII sequence MUST be created"* -> did you mean the exact same *byte* sequence? It doesn't really make sense to call the result of an UTF8 encoding an "ASCII sequence". — Wim Coenen, Nov 12 '10 at 15:34
"As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode" – that's not obvious at all, given that the first Windows system based on Unicode was released in 1993. — Philipp, Nov 16 '10 at 18:02
The first NT-based version of Windows I saw was Windows 2000, before that it might as well not have existed. :-) — Head Geek, Nov 16 '10 at 22:41

score 4 · Accepted Answer · edited Nov 13 '10 at 13:59

4

Any given Unicode string will have only one representation in UTF-8.

I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.

But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it's perfectly reversible.

The conversion rules are here: http://en.wikipedia.org/wiki/UTF-8

edited Nov 13 '10 at 13:59

tchrist

76,727
28
123
176

answered Nov 12 '10 at 15:35

John Knoeller

32,385
4
57
92

Thanks. You're right, after perusing RFC3629 (linked from the Wikipedia entry), I could see that. That may still cause trouble for our program, but it should be manageable. – Head Geek Nov 12 '10 at 17:53

score 3 · Answer 2 · answered Nov 13 '10 at 09:29

As John already said, there is only one standards-conforming UTF-8 representation.

But the tricky point is "standards-conforming". Older encoders are usually unable to properly convert UTF-16 because of surrogates. Java is one notable case of those non-conforming converters (it will produce two 3-bytes sequences instead of one 4-byte sequence). MySQL had problems until recently, and I am not sure about the current status.

Now, you will only have problems with code points that need surrogates, meaning above U+FFFF. If you application survived without Unicode for a long time, it means you never needed to move such "esoteric" characters :-)

But it is good to get things right from the get go. Try using standards-conforming encoders and you will be fine.

Thanks for the advice. I'll write my own converter when I get that far, just to be certain that it works right (or that if it doesn't, I can fix it). — Head Geek, Nov 15 '10 at 14:33

Can I get a single canonical UTF-8 string from a Unicode string?

2 Answers2

Linked