Convert ISO-8859-1 strings to UTF-8 in C/C++

Question

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

I found one commercial product, but it's beyond my budget at this time.

There's nothing simple about it. You could use the open source ICU library. — Hans Passant, Oct 30 '10 at 17:23
If you have to do it, then the simplest code is to pre-generate a table of the 128 (or so) UTF-8 characters corresponding to the 8859-1 characters with the top bit set. The other 128 8859-1 characters are unmodified. That way, your code doesn't have to understand Unicode at all. Also, beware the difference between ISO-8859-1 and Windows CP-1252. The latter has some extra characters in it where 8859-1 has gaps (unused code points). Unless you're supposed to be validating that your input really is ISO-8859-1, there's no point not accepting CP-1252, because you *will* see it mislabelled. — Steve Jessop, Oct 30 '10 at 17:30
@Steve: since UTF-8 is variable length (in this case, 1 or 2 bytes per character), a lookup table is not so easy to use. See my answer which should be just as fast and a lot simpler. — R.. GitHub STOP HELPING ICE, Oct 30 '10 at 17:54
@R.: well, "easy" is a relative term. `stpcpy` helps, provided you're the kind of programmer who's good with buffer sizes. — Steve Jessop, Oct 30 '10 at 18:48
`stpcpy` (even if it is standard or headed towards being standard now..?) is a helluvalot of overhead for 1- and 2-byte copies. You'd be better off just always copying 2 bytes (by hand) and including some code to skip the second pointer advance if the byte copied was 0 (which can almost surely be branchless). — R.. GitHub STOP HELPING ICE, Oct 31 '10 at 16:48

score 40 · Accepted Answer · answered Oct 30 '10 at 17:53

40

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

answered Oct 30 '10 at 17:53

R.. GitHub STOP HELPING ICE

201,833
32
354
689

3

Wow. This is very helpful! I wasn't looking forward to yet-another table lookup algorithm. Now for ANSEL-to-UTF-8... – gordonwd Oct 30 '10 at 18:31
9

This certainly answers the question. But as I said in a comment above, people *will* send you CP-1252 mislabelled as ISO-8859-1. Web servers are the example that I've tripped over that persuaded me of the problem, but also text editors that claim to be saving as "Latin-1" when they aren't. That "if your source encoding will always be ISO-8859-1" is a pretty big "if", and it might be hard to track down and eliminate the miscreant responsible. – Steve Jessop Oct 30 '10 at 18:46
1

@Steve: You could add an `else if (*in<192) goto error;` case to error-out on encountering any ISO-8859-1 control codes (which are probably misencoded Windows-1252 characters, and not useful characters anyway). – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:36
2

@gordon: I'm not familiar with ANSEL, but you should be aware that ISO-8859-1 is the **only** legacy encoding that's this easy to convert to UTF-8. Everything else will require lookup tables. A Steve said, my "If.." is a **big** if. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:37
7

This is quite poorly written code from a maintainability standpoint. Use more braces. – syb0rg Feb 04 '14 at 00:18
how would i simplify this to do only 1 character? I'm trying to understand what this code does, and the simplification will help me to understand.. – user230910 Sep 29 '15 at 07:56
@MaximEgorushkin Not trying to defend the code, but it does have `,`, which acts as a sequence point. – user694733 Sep 29 '15 at 08:18
@user694733 You are right, there is a sequence point at built-in `,` operator. – Maxim Egorushkin Sep 29 '15 at 08:25
@R.. < 192 "not useful"? 163, the £ sign, is useful for us folks in the UK. I think you meant 160 rather than 192. – Nick Jun 16 '19 at 15:59
2

@Nick: Yep, I meant 0xA0 and just converted to decimal in my head incorrectly. Comment is way too old to edit though. – R.. GitHub STOP HELPING ICE Jun 16 '19 at 20:06

score 16 · Answer 2 · answered Oct 05 '16 at 21:37

16

To c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

answered Oct 05 '16 at 21:37

Lord Raiden

181
1
3

Can You please share the Latin7 version? – Ronalds Mazītis May 27 '21 at 16:58

score 5 · Answer 3 · edited Dec 11 '20 at 14:38

5

You can use the boost::locale library:

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");

edited Dec 11 '20 at 14:38

jpo38

20,152
7
66
136

answered May 31 '17 at 12:09

Spacemoose

3,669
1
26
45

score 3 · Answer 4 · edited Jan 24 '18 at 12:36

3

The C++03 standard does not provide functions to directly convert between specific charsets.

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

edited Jan 24 '18 at 12:36

Cheers and hth. - Alf

138,963
15
198
315

answered Oct 30 '10 at 17:29

cytrinox

1,726
4
22
43

> **”** The C++ standard does not provide functions to directly convert between charsets – Cheers and hth. - Alf Jan 24 '18 at 12:34

score 2 · Answer 5 · answered Oct 31 '10 at 00:44

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

score 0 · Answer 6 · answered Oct 30 '10 at 17:39

0

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

The C++ aspects -- integrating that with iostreams -- are much harder.

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

Cheers & hth.,

answered Oct 30 '10 at 17:39

Cheers and hth. - Alf

138,963
15
198
315

The algorithm is not entirely trivial, especially when novice to intermediate C coders often mistakenly use `char *` where `unsigned char *` is needed. More significant nontrivialities are in the definition of UTF-8, specifically that you need to reject surrogate codepoints and out-of-range values. Thankfully those won't come up in an encoder that only needs to handle ISO-8859-1 input, but if you write such a limited encoder it's likely someone will end up misusing it for a larger input range later without adding any checks. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:40
@MichałLeon: Unicode is not an encoding. There are a number of different encodings of Unicode, including UTF-8 and UTF-16. The first 256 code points of Unicode are the same as Latin 1 (a.k.a. ISO-8859-1). Note: emphasis doesn't make you less at odds with trivial fact. Next time, instead of shouting and downvoting, consider simply checking facts, or just ask about anything you don't understand. – Cheers and hth. - Alf Jan 23 '18 at 17:23
@Martin: The block of Unicode code points 128 through 255 is called the ["Latin-1 supplement" of Unicode](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)), because it's the same as Latin-1. Unicode is a direct extension of Latin-1. You comments are absurd nonsense, the kind of techno-babble that can influence non-technical people and indicates trolling. I presume you're trolling. – Cheers and hth. - Alf Jan 24 '18 at 10:59
@MichałLeon: OK, sorry. I should maybe have guessed: I have for many years helped a student with extremely bad eye-sight, and she routinely fails to see what's right there. Latin-1 is specified in the OP's posting, in my answer, in all my comments, and in the other answers except one. – Cheers and hth. - Alf Jan 24 '18 at 13:50

score 0 · Answer 7 · answered May 27 '21 at 17:37

0

Why would You need -1 not -7. According to my test with sql, You can't even store special characters in -1. So what exactly are You trying to convert?

answered May 27 '21 at 17:37

Ronalds Mazītis

273
2
4
17

Convert ISO-8859-1 strings to UTF-8 in C/C++

7 Answers7

Linked