3

When getting input from std::cin in windows, the input is apparently always in the encoding windows-1252 (the default for the host machine in my case) despite all the configurations made, that apparently only affect to the output. Is there a proper way to capture input in windows in UTF-8 encoding?

For instance, let's check out this program:

#include <iostream>

int main(int argc, char* argv[])
{
    std::cin.imbue(locale("es_ES.UTF-8"));
    std::cout.imbue(locale("es_ES.UTF-8"));

    std::cout << "ñeñeñe> ";
    std::string in; 
    std::getline( std::cin, in ); 
    std::cout << in; 

}

I've compiled it using visual studio 2022 in a windows machine with spanish locale. The source code is in UTF-8. When executing the resulting program (windows powershell session, after executing chcp 65001 to set the default encoding to UTF-8), I see the following:

PS C:\> .\test_program.exe
ñeñeñe> ñeñeñe
 e e e

The first "ñeñeñe" is correct: it display correctly the "ñ" caracter to the output console. So far, so good. The user input is echoed back to the console correctly: another good point. But! when it turns to send back the encoded string to the ouput, the "ñ" caracter is substituted by an empty space.

When debugging this program, I see that the variable "in" have captured the input in an encoding that it is not utf-8: for the "ñ" it use only one character, whereas in utf-8 that caracter must consume two. The conclusion is that the input is not affect for the chcp command. Is something I doing wrong?

UPDATE

Somebody have asked me to see what happens when changing to wcout/wcin:

std::wcout << u"ñeñeñe> ";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;

Behaviour:

PS C:\> .\test.exe
0,000,7FF,6D1,B76,E30ñeñeñe
 e e e

Other try (setting the string as L"ñeñeñe"):

ñeñeñe> ñeñeñe
 e e e

Leaving it as is:

std::wcout << "ñeñeñe> ";

Result is:

eee>
Raul Luna
  • 1,801
  • 1
  • 20
  • 25
  • Have you tried `wcin`? – Thomas Weller Mar 14 '22 at 09:54
  • Yes, I've tried, and got some.... let's say flowery results. If using wcin, wcout only, the input string is output back as rubbish, because is encoded internally as ¿maybe UTF-16? and send back as utf-16 when the expected string should be in utf-8. I can update the question accordingly if you want – Raul Luna Mar 14 '22 at 09:57
  • Yes, I think that's helpful to tell a) that you know about it and b) it does not solve the problem. – Thomas Weller Mar 14 '22 at 10:01
  • 1
    It seems it's really that bad: https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/#utf8_mode_input – Thomas Weller Mar 14 '22 at 10:13
  • Definitely that's the problem. – Raul Luna Mar 14 '22 at 10:27
  • I wonder if it's a workaround for this, like using a windows forms or something like that that emulates a console. The problem is I want to create a simple console application that accepts and returns UTF-8 and it seems an impossible thing – Raul Luna Mar 14 '22 at 10:28
  • It is not impossible but you might have to do it yourself with the CRT. – Anders Mar 14 '22 at 11:50
  • @Anders, I really doubt that nobody have ever bumped on this. How it's done in Java, for instance??? AFAIK, java is compiled in C and internally their strings are UTF-8, so some kind of conversion is made behind the scenes – Raul Luna Mar 14 '22 at 13:12
  • @RaulLuna: it uses different method. Consoles read the keys and translate into characters to send to the program. Many programs go in the hard way: read the scan code, interpret the keyboard, find which character you get. – Giacomo Catenazzi Mar 14 '22 at 13:37
  • microsoft recently implemented pty API. It included also a series of blog post, which were interesting: it told us that many programs just put a console outside visible desktop )with active keyboard) and grab images of that console, because there were no other way to have reliable interface. Ugly. – Giacomo Catenazzi Mar 14 '22 at 13:41
  • I meant to say without the CRT. Meaning, `ReadFile` on `GetStdHandle`. You of course have to convert the input bytes yourself if they are not already UTF-8. – Anders Mar 14 '22 at 13:52
  • You can use `_setmode` on both stdout and stdin to read wide strings that you could convert to UTF-8. See if [this answer](https://stackoverflow.com/a/65816756/235698) helps you. – Mark Tolonen Mar 14 '22 at 15:42
  • Yes, @MarkTolonen, I've tried that way, but the results are not satisfactory (see my own answer). The funny part is because the program is stored -or compiled, I don't know- into UTF-8, the string is converted into funny ways, resulting in rubbish most of the time – Raul Luna Mar 15 '22 at 15:27
  • The best solution I've come so far come from the example program in this page: https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/#utf8_mode_input. This allows to 1) enter the literal strings in unicode, 2) input text in unicode and the output will be in unicode also – Raul Luna Mar 15 '22 at 17:07

1 Answers1

0

This is the closest to the solution I've found so far:

int main(int argc, char* argv[])
{
    _setmode(_fileno(stdout), _O_WTEXT);
    _setmode(_fileno(stdin), _O_WTEXT);

    std::wcout << L"ñeñeñe";
    std::wstring in;
    std::getline(std::wcin, in);
    std::wcout << in;

    return 0;
}

The solution depicted here went in the right direction. Problem: both stdin and stdout should be in the same configuration, because the echo of the console rewrites the input. The problem is the writing of the string with \uXXXX codes.... I am guessing how to overcome that or using #define's to overcome and clarify the text literals

Raul Luna
  • 1,801
  • 1
  • 20
  • 25
  • Send wide strings to`wcout` (`L"你好"`). You don’t have to use escape code if you save the source in UTF-8 and set the compiler source charset to UTF-8 if not the default. `/utf-8` on MSVC compiler. – Mark Tolonen Mar 16 '22 at 01:32
  • Ok, thanks. Will do. – Raul Luna Mar 16 '22 at 08:19
  • Hi Raul Luna, glad to know you've found the solution to resolve this issue! Please consider answering it and accepting it as an answer to change its status to Answered. It will also help others to solve a similar issue. See [can I answer my own question..](https://stackoverflow.com/help/self-answer), Just a reminder :) – Yujian Yao - MSFT Mar 17 '22 at 05:15