2

I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.

Sample code:

    include <iostream>
    #include "sys/types.h"

    using namespace std;

    int main()
    {

        std::basic_string<u_int16_t> ustr1(std::basic_string<u_int16_t>((u_int16_t*)"ยฤขฃ", 4));
        std::basic_string<u_int16_t> ustr2(std::basic_string<u_int16_t>((u_int16_t*)"abcd", 4));

        for (int i = 0; i < ustr1.length(); i++)
            cout << "Char: " << ustr1[i] << endl;

        for (int i = 0; i < ustr2.length(); i++)
            cout << "Char: " << ustr2[i] << endl;

        if (ustr1 == ustr2)
            cout << "Strings are equal" << endl;

        cout << "string length: " << ustr1.length() << "\t" << ustr2.length() << endl;
        return 0;
    }

The strings contain Thai characters and ascii characters, and the intent behind using basic_string<u_int16_t> is to facilitate storage of characters which cannot be accommodated within a single byte. The code was run on a Linux box, whose encoding type is en_US.UTF-8. The output is:

$ ./a.out
Char: 47328
Char: 57506
Char: 42168
Char: 47328
Char: 25185
Char: 25699
Char: 17152
Char: 24936
string length: 4        4

A few questions:

  1. Do the character values in the output correspond to en_US.UTF-8 code points? If not, what are they?

  2. Would the std::string operators like ==, !=, < etc., be able to work with Unicode code points? If so, would it be a mere comparison of each code points in the corresponding locations? Would std::map work on similar lines?

  3. Would changing the locale to UTF-16 result in the strings getting stored as UTF-16 code points?

Thanks!

Maddy
  • 1,251
  • 3
  • 20
  • 36

1 Answers1

8

I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.

They don't.

std::string is a sequence of chars or bytes. It is not a "high-level" string taking any encoding into account. You must do that yourself, e.g. by using a library dedicated to that purpose such as ICU.

Switching from std::string (i.e. std::basic_string<char>) to std::basic_char<u_int16_t> doesn't change that; it just means you have a sequence of "wide" characters instead.

And std::map has nothing to do with this at all.

Further reading:

Community
  • 1
  • 1
Lightness Races in Orbit
  • 369,052
  • 73
  • 620
  • 1,021
  • Thanks for the clarification. If UTF-16 encoded strings (containing non-ascii characters) are stored in `std::basic_char` type, how do the string operations like `==`, `!=`, ` – Maddy Apr 21 '16 at 04:49
  • I meant, how do the string operations *on* these strings fare? – Maddy Apr 21 '16 at 05:02
  • @Maddy: It's not clear what you're asking. What do you mean how do they "fare"? They fare excellently. They perform exactly the operation they are designed and specified to perform; that is, operations of a sequence of `char`/`u_int16_t`, with no regard for encoding whatsoever. Which is what I said in my answer. But I don't understand why you think `==` would ever fail to perform its job of checking for equality? – Lightness Races in Orbit Apr 21 '16 at 10:22
  • Consider the case where `std::basic_char` stores an UTF-16 encoded string. What is essentially stored in memory are the code units corresponding to the characters, correct? If so, for `==` operation, the values (code units, in this case) in the corresponding memory locations are pulled up for comparison, and are deemed equal or unequal. Is the understanding correct so far? – Maddy Apr 22 '16 at 04:22
  • For instance, the UTF-16LE code unit for `` is `52 D8 62 DF` (surrogate pairs). – Maddy Apr 22 '16 at 04:32
  • @Maddy: Yes? I'm not understanding what else it could possibly mean. `==` performs an equality comparison. It checks whether things are equal. – Lightness Races in Orbit Apr 22 '16 at 08:52
  • Thanks for the clarification! The discussion, though elementary, was useful :) – Maddy Apr 22 '16 at 14:40