0

I'm using urllib2 to open a russian website and extract text from it. However, instead of coming out as "Беллона" it's coming out as "Áåëëîíà". What's the easiest way to get around this?

maxko87
  • 2,765
  • 4
  • 27
  • 43

2 Answers2

2

Figure out which encoding the webpage uses (probably utf-8 or ISO 8859-5), and convert your text to unicode like this:

ustring = unicode(read_string, encoding=...)

If you need to determine the encoding of a webpage dynamically, see this SO answer.

Community
  • 1
  • 1
alexis
  • 46,350
  • 14
  • 97
  • 153
1

Try this:

doc = urllib.open('http://yandex.ru').read()
doc = doc.decode('utf-8')

That's all ;)

Denis
  • 6,591
  • 6
  • 36
  • 57