I find in the new C++ Standard
2.11 Identifiers [lex.name]
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined character
with the additional text
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. [...]
I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written \u89ab for example. But using those in an identifier...? Really?
Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32, I think I understood those. But:
- Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
- Can I write an identifier with
\u1234in itmyfu\u1234ntion(for whatever purpose) Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;or even in an identifier in the source itself? That would be a treat... cough...
I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)
Edit: I found "2.2 Phases of translation [lex.phases]", Phase 1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN the same way.
What do you think?