Unable to work with utf8 character in c++

Question

#include <iostream>
#include <string>
#include <stdio.h>


using namespace std;
int main(){
    string str = "∑カ[キ…クケコ°サシÀスセÏÔÎソタ]—チツテトÃナニヌÊネノЖИѠѬѰѪᐂᑧᐫᐑᕓᕩᘷᙈᏍsᏜᎹ᳐盘的";
    cout << "--> String: " << str << endl;
    cout<<"--> Size str1: "<<str.size()<<endl;   

    for(unsigned ii=0; ii<=str.size();++ii)
    { 
        cout <<"--> ii: "<<ii<< " --> Character: "<< str[ii] <<endl;
    }
}

I'm using the ConEmu console with chcp 65001 setting(utf8), everything works find when displaying the string str. But when I'm trying to use each individual character of the string str and displaying I got a wrong display. Does anybody tell me how to work with individual character ?

Thanks

Try to use `wcout` and `wstring`, and add prefix `L` before string literal? — con ko, Apr 01 '20 at 16:20
@JohnDing Please absolutely don’t. `std::wstring` should be relegated to history and never touched except to interface with legacy API. — Konrad Rudolph, Apr 01 '20 at 16:24
@KonradRudolph ahh, i just want to give a quick fix in the comment section. I know using `u16_string` and `u32_string` will be better. However it's hard to find suitable IO operations for them. As for `u8_string`, I don't think C++20 is an option. — con ko, Apr 01 '20 at 16:28
Also, `for(unsigned ii=0; ii<=str.size();++ii)` out of bounds. — Ashwani, Apr 01 '20 at 16:29
@KonradRudolph Of course, your suggestion is absolutely correct. — con ko, Apr 01 '20 at 16:29
@JohnDing that won't help! On Windows it's UTF-16 so you can't print characters outside the BMP if you print the individual code units — phuclv, Apr 02 '20 at 10:32

score 0 · Answer 1 · answered Apr 01 '20 at 16:25

Does anybody tell me how to work with individual character ?

By following the Unicode specification.

Individual char objects in C++ correspond to a code unit of unicode. Interleaving other code units in between separate code units of a single character will break the encoding.

There is no standard C++ function to iterate unicode characters.

rustyx · Answer 2 · 2020-04-02T10:13:20.247

UTF-8 uses between 1 and 4 bytes to encode a single character.

So you can decode it by reading as many bytes as needed based on the value of the first byte:

0xxxxxxx - 1 byte
110xxxxx - 2 bytes
1110xxxx - 3 bytes
11110xxx - 4 bytes

(notice some gaps between these values - those are invalid UTF-8 values)

For example like this:

#include <iomanip>
#include <iostream>
#include <string>
#include <stdio.h>

using namespace std;
int main() {
    string str = "∑カ[キ…クケコ°サシÀスセÏÔÎソタ]—チツテトÃナニヌÊネノЖИѠѬѰѪᐂᑧᐫᐑᕓᕩᘷᙈᏍsᏜᎹ᳐盘的";
    cout << "--> String: " << str << endl;
    cout << "--> Size str1: " << str.size() << endl;

    string buf;
    int i = 0, count = 0;
    for (unsigned char c : str)
    {
        if (count == 0) {
            buf = c;
            if (c >= 0xF0)
                count = 3;
            else if (c >= 0xE0)
                count = 2;
            else if (c >= 0xC0)
                count = 1;
        } else {
            buf += c;
            --count;
        }
        if (count > 0)
            continue;
        cout << "--> ii: " << i++ << " --> Character: " << buf;
        cout << "  UTF-8 bytes:";
        for (unsigned char b : buf) {
            cout << " " << uppercase << hex << setfill('0') << setw(2) << (int)b;
        }
        cout << endl;
    }
}

Output:

--> String: ∑カ[キ…クケコ°サシÀスセÏÔÎソタ]—チツテトÃナニヌÊネノЖИѠѬѰѪᐂᑧᐫᐑᕓᕩᘷᙈᏍsᏜᎹ᳐盘的
--> Size str1: 140
--> ii: 0 --> Character: ∑  UTF-8 bytes: E2 88 91
--> ii: 1 --> Character: カ  UTF-8 bytes: E3 82 AB
--> ii: 2 --> Character: [  UTF-8 bytes: 5B
--> ii: 3 --> Character: キ  UTF-8 bytes: E3 82 AD
--> ii: 4 --> Character: …  UTF-8 bytes: E2 80 A6
--> ii: 5 --> Character: ク  UTF-8 bytes: E3 82 AF
--> ii: 6 --> Character: ケ  UTF-8 bytes: E3 82 B1
--> ii: 7 --> Character: コ  UTF-8 bytes: E3 82 B3
--> ii: 8 --> Character: °  UTF-8 bytes: C2 B0
--> ii: 9 --> Character: サ  UTF-8 bytes: E3 82 B5
--> ii: A --> Character: シ  UTF-8 bytes: E3 82 B7
--> ii: B --> Character: À  UTF-8 bytes: C3 80
--> ii: C --> Character: ス  UTF-8 bytes: E3 82 B9
--> ii: D --> Character: セ  UTF-8 bytes: E3 82 BB
--> ii: E --> Character: Ï  UTF-8 bytes: C3 8F
--> ii: F --> Character: Ô  UTF-8 bytes: C3 94
--> ii: 10 --> Character: Î  UTF-8 bytes: C3 8E
--> ii: 11 --> Character: ソ  UTF-8 bytes: E3 82 BD
--> ii: 12 --> Character: タ  UTF-8 bytes: E3 82 BF
--> ii: 13 --> Character: ]  UTF-8 bytes: 5D
--> ii: 14 --> Character: —  UTF-8 bytes: E2 80 94
--> ii: 15 --> Character: チ  UTF-8 bytes: E3 83 81
--> ii: 16 --> Character: ツ  UTF-8 bytes: E3 83 84
--> ii: 17 --> Character: テ  UTF-8 bytes: E3 83 86
--> ii: 18 --> Character: ト  UTF-8 bytes: E3 83 88
--> ii: 19 --> Character: Ã  UTF-8 bytes: C3 83
--> ii: 1A --> Character: ナ  UTF-8 bytes: E3 83 8A
--> ii: 1B --> Character: ニ  UTF-8 bytes: E3 83 8B
--> ii: 1C --> Character: ヌ  UTF-8 bytes: E3 83 8C
--> ii: 1D --> Character: Ê  UTF-8 bytes: C3 8A
--> ii: 1E --> Character: ネ  UTF-8 bytes: E3 83 8D
--> ii: 1F --> Character: ノ  UTF-8 bytes: E3 83 8E
--> ii: 20 --> Character: Ж  UTF-8 bytes: D0 96
--> ii: 21 --> Character: И  UTF-8 bytes: D0 98
--> ii: 22 --> Character: Ѡ  UTF-8 bytes: D1 A0
--> ii: 23 --> Character: Ѭ  UTF-8 bytes: D1 AC
--> ii: 24 --> Character: Ѱ  UTF-8 bytes: D1 B0
--> ii: 25 --> Character: Ѫ  UTF-8 bytes: D1 AA
--> ii: 26 --> Character: ᐂ  UTF-8 bytes: E1 90 82
--> ii: 27 --> Character: ᑧ  UTF-8 bytes: E1 91 A7
--> ii: 28 --> Character: ᐫ  UTF-8 bytes: E1 90 AB
--> ii: 29 --> Character: ᐑ  UTF-8 bytes: E1 90 91
--> ii: 2A --> Character: ᕓ  UTF-8 bytes: E1 95 93
--> ii: 2B --> Character: ᕩ  UTF-8 bytes: E1 95 A9
--> ii: 2C --> Character: ᘷ  UTF-8 bytes: E1 98 B7
--> ii: 2D --> Character: ᙈ  UTF-8 bytes: E1 99 88
--> ii: 2E --> Character: Ꮝ  UTF-8 bytes: E1 8F 8D
--> ii: 2F --> Character: s  UTF-8 bytes: 73
--> ii: 30 --> Character: Ꮬ  UTF-8 bytes: E1 8F 9C
--> ii: 31 --> Character: Ꮉ  UTF-8 bytes: E1 8E B9
--> ii: 32 --> Character: ᳐  UTF-8 bytes: E1 B3 90
--> ii: 33 --> Character: 盘  UTF-8 bytes: E7 9B 98
--> ii: 34 --> Character: 的  UTF-8 bytes: E7 9A 84

As you can see, each UTF-8 code point in the string is encoded using 1, 2 or 3 bytes (note that the char data type usually contains just 1 byte).

This can be inconvenient if you want to work with individual Unicode symbols as a unit. In this case you can convert the string to a wstring and work with wide-char type (wchar_t) instead of char.

See the following link to the question about how to convert a string to a wstring.

`UTF-8 uses between 1 and 4 bytes to encode a single character.` More accurately, 1 to 4 bytes to encode a single code point. A character is not necessarily a single code point, but may consist of multiple due to combining characters. This example logic would break in presence of those. — eerorika, Apr 01 '20 at 16:46
By character I mean Unicode character (a.k.a. code point), not the C++ `char`. — rustyx, Apr 01 '20 at 16:48
by character, I mean grapheme cluster i.e. a single symbol that you see on screen. — eerorika, Apr 01 '20 at 16:48
It's working for display, but do you think it will work by using the find method ? like finding a character in the str ? — Gilles06, Apr 01 '20 at 17:16
Yes, you can search for a substring consisting of bytes of a single Unicode code point. This works because UTF-8 is self-synchronizing i.e. it's impossible to find a code point at the middle of another code point. What you cannot do is select Unicode characters randomly by offset, you can only do that sequentially since the position of a UTF-8 character depends on the size of all previous characters in the string. — rustyx, Apr 01 '20 at 18:22
Thks. But still do not understand why when using "shuffle (str.begin(), str.end(), default_random_engine(seed));." and display the str after I have a wrong display ? Isthe byte value is affected by the shuffle ? — Gilles06, Apr 02 '20 at 09:45
A UTF-8 encoded symbol consists of multiple bytes. shuffle breaks that because it moves individual bytes around. See my updated answer for more details. — rustyx, Apr 02 '20 at 10:13
Thks rustyx. Does that mean c++ is not able to manipulate Unicode easily ? Python 3.4 is doing it — Gilles06, Apr 03 '20 at 11:11
rustyx: If I understood well your solution, I have to isolate and work with substring at bytes level of each character ? If so, it's very complicated. — Gilles06, Apr 03 '20 at 13:40
Yes, C++ is much more low-level than Python and requires full understanding of how strings are represented in memory. But as I said, consider trying out `wstring`, it can simplify working with Unicode. — rustyx, Apr 04 '20 at 12:12

Unable to work with utf8 character in c++

2 Answers2

Linked