How to convert a string to UTF8 in Ruby

Question

I'm writing a crawler which uses Hpricot. It downloads a list of strings from some webpage, then I try to write it to the file. Something is wrong with the encoding:

"\xC3" from ASCII-8BIT to UTF-8

I have items which are rendered on a webpage and printed this way:

DÃ©veloppement

the str.encoding returns UTF-8, so force_encoding('UTF-8') doesn't help. How may I convert this to readable UTF-8?

Hpricot is no longer maintained, consider using Nokogiri. Also, you should probably mention what the encoding of the original web page is. — Andrew Marshall, Jun 10 '13 at 12:06

score 69 · Answer 1 · answered Jun 10 '13 at 12:24

69

Your string seems to have been encoded the wrong way round:

"DÃ©veloppement".encode("iso-8859-1").force_encoding("utf-8")
#=> "Développement"

answered Jun 10 '13 at 12:24

Stefan

102,972
12
132
203

1

It works good for most of cases. But sometimes: `U+201C from UTF-8 to ISO-8859-1 in CIDEM / ACC1Ã“` `U+20AC from UTF-8 to ISO-8859-1 in Citiâ€™s Sustainable Development Investments` it doesn't. Also some names are converted but wrong and I can't seed it in a database with `incomplete multibyte character` error message – ciembor Jun 11 '13 at 12:36
1

Sorry, this was not meant as a fix. You should fix the problem by setting/detecting the correct encoding when reading the strings into your app. – Stefan Jun 11 '13 at 12:46
1

There is also the option of using `Encoding::UTF_8` instead of using more memory for the `"utf-8"` string literal (or any other encoding string). – Todd Feb 17 '18 at 19:22

score 58 · Answer 2 · edited Feb 27 '17 at 23:49

58

Seems your string thinks it is UTF-8, but in reality, it is something else, probably ISO-8859-1.

Define (force) the correct encoding first, then convert it to UTF-8.

In your example:

puts "DÃ©veloppement".encode('iso-8859-1').encode('utf-8')

An alternative is:

puts "\xC3".force_encoding('iso-8859-1').encode('utf-8') #-> Ã

If the Ã makes no sense, then try another encoding.

edited Feb 27 '17 at 23:49

the Tin Man

155,156
41
207
295

answered Jun 10 '13 at 14:33

knut

26,435
6
84
110

Works for pdfs created with Wicked PDF gem – Lucas Andrade Jan 22 '19 at 17:23

score 5 · Answer 3 · edited May 23 '17 at 12:03

5

"ruby 1.9: invalid byte sequence in UTF-8" described another good approach with less code:

file_contents.encode!('UTF-16', 'UTF-8')

edited May 23 '17 at 12:03

Community

1
1

answered Jan 08 '15 at 13:43

kaleb4eg

1,895
1
16
11

How to convert a string to UTF8 in Ruby

3 Answers3

Linked