24

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

I found one commercial product, but it's beyond my budget at this time.

Machavity
  • 29,816
  • 26
  • 86
  • 96
gordonwd
  • 4,447
  • 9
  • 35
  • 53
  • 4
    There's nothing simple about it. You could use the open source ICU library. – Hans Passant Oct 30 '10 at 17:23
  • 3
    If you have to do it, then the simplest code is to pre-generate a table of the 128 (or so) UTF-8 characters corresponding to the 8859-1 characters with the top bit set. The other 128 8859-1 characters are unmodified. That way, your code doesn't have to understand Unicode at all. Also, beware the difference between ISO-8859-1 and Windows CP-1252. The latter has some extra characters in it where 8859-1 has gaps (unused code points). Unless you're supposed to be validating that your input really is ISO-8859-1, there's no point not accepting CP-1252, because you *will* see it mislabelled. – Steve Jessop Oct 30 '10 at 17:30
  • @Steve: since UTF-8 is variable length (in this case, 1 or 2 bytes per character), a lookup table is not so easy to use. See my answer which should be just as fast and a lot simpler. – R.. GitHub STOP HELPING ICE Oct 30 '10 at 17:54
  • @R.: well, "easy" is a relative term. `stpcpy` helps, provided you're the kind of programmer who's good with buffer sizes. – Steve Jessop Oct 30 '10 at 18:48
  • `stpcpy` (even if it is standard or headed towards being standard now..?) is a helluvalot of overhead for 1- and 2-byte copies. You'd be better off just always copying 2 bytes (by hand) and including some code to skip the second pointer advance if the byte copied was 0 (which can almost surely be branchless). – R.. GitHub STOP HELPING ICE Oct 31 '10 at 16:48

7 Answers7

40

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

R.. GitHub STOP HELPING ICE
  • 201,833
  • 32
  • 354
  • 689
  • 3
    Wow. This is very helpful! I wasn't looking forward to yet-another table lookup algorithm. Now for ANSEL-to-UTF-8... – gordonwd Oct 30 '10 at 18:31
  • 9
    This certainly answers the question. But as I said in a comment above, people *will* send you CP-1252 mislabelled as ISO-8859-1. Web servers are the example that I've tripped over that persuaded me of the problem, but also text editors that claim to be saving as "Latin-1" when they aren't. That "if your source encoding will always be ISO-8859-1" is a pretty big "if", and it might be hard to track down and eliminate the miscreant responsible. – Steve Jessop Oct 30 '10 at 18:46
  • 1
    @Steve: You could add an `else if (*in<192) goto error;` case to error-out on encountering any ISO-8859-1 control codes (which are probably misencoded Windows-1252 characters, and not useful characters anyway). – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:36
  • 2
    @gordon: I'm not familiar with ANSEL, but you should be aware that ISO-8859-1 is the **only** legacy encoding that's this easy to convert to UTF-8. Everything else will require lookup tables. A Steve said, my "If.." is a **big** if. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:37
  • 7
    This is quite poorly written code from a maintainability standpoint. Use more braces. – syb0rg Feb 04 '14 at 00:18
  • how would i simplify this to do only 1 character? I'm trying to understand what this code does, and the simplification will help me to understand.. – user230910 Sep 29 '15 at 07:56
  • @MaximEgorushkin Not trying to defend the code, but it does have `,`, which acts as a sequence point. – user694733 Sep 29 '15 at 08:18
  • @user694733 You are right, there is a sequence point at built-in `,` operator. – Maxim Egorushkin Sep 29 '15 at 08:25
  • @R.. < 192 "not useful"? 163, the £ sign, is useful for us folks in the UK. I think you meant 160 rather than 192. – Nick Jun 16 '19 at 15:59
  • 2
    @Nick: Yep, I meant 0xA0 and just converted to decimal in my head incorrectly. Comment is way too old to edit though. – R.. GitHub STOP HELPING ICE Jun 16 '19 at 20:06
16

To c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}
Lord Raiden
  • 181
  • 1
  • 3
5

You can use the boost::locale library:

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");
jpo38
  • 20,152
  • 7
  • 66
  • 136
Spacemoose
  • 3,669
  • 1
  • 26
  • 45
3

The C++03 standard does not provide functions to directly convert between specific charsets.

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

Cheers and hth. - Alf
  • 138,963
  • 15
  • 198
  • 315
cytrinox
  • 1,726
  • 4
  • 22
  • 43
2

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

RBerteig
  • 40,004
  • 7
  • 84
  • 125
0

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

The C++ aspects -- integrating that with iostreams -- are much harder.

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

Cheers & hth.,

Cheers and hth. - Alf
  • 138,963
  • 15
  • 198
  • 315
  • The algorithm is not entirely trivial, especially when novice to intermediate C coders often mistakenly use `char *` where `unsigned char *` is needed. More significant nontrivialities are in the definition of UTF-8, specifically that you need to reject surrogate codepoints and out-of-range values. Thankfully those won't come up in an encoder that only needs to handle ISO-8859-1 input, but if you write such a limited encoder it's likely someone will end up misusing it for a larger input range later without adding any checks. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:40
  • @MichałLeon: Unicode is not an encoding. There are a number of different encodings of Unicode, including UTF-8 and UTF-16. The first 256 code points of Unicode are the same as Latin 1 (a.k.a. ISO-8859-1). Note: emphasis doesn't make you less at odds with trivial fact. Next time, instead of shouting and downvoting, consider simply checking facts, or just ask about anything you don't understand. – Cheers and hth. - Alf Jan 23 '18 at 17:23
  • @Martin: The block of Unicode code points 128 through 255 is called the ["Latin-1 supplement" of Unicode](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)), because it's the same as Latin-1. Unicode is a direct extension of Latin-1. You comments are absurd nonsense, the kind of techno-babble that can influence non-technical people and indicates trolling. I presume you're trolling. – Cheers and hth. - Alf Jan 24 '18 at 10:59
  • @MichałLeon: OK, sorry. I should maybe have guessed: I have for many years helped a student with extremely bad eye-sight, and she routinely fails to see what's right there. Latin-1 is specified in the OP's posting, in my answer, in all my comments, and in the other answers except one. – Cheers and hth. - Alf Jan 24 '18 at 13:50
0

Why would You need -1 not -7. According to my test with sql, You can't even store special characters in -1. So what exactly are You trying to convert?

Ronalds Mazītis
  • 273
  • 2
  • 4
  • 17