Cyrillic text extraction in Python/Django

Question

I'm using urllib2 to open a russian website and extract text from it. However, instead of coming out as "Беллона" it's coming out as "Áåëëîíà". What's the easiest way to get around this?

score 2 · Answer 1 · edited May 23 '17 at 12:29

2

Figure out which encoding the webpage uses (probably utf-8 or ISO 8859-5), and convert your text to unicode like this:

ustring = unicode(read_string, encoding=...)

If you need to determine the encoding of a webpage dynamically, see this SO answer.

edited May 23 '17 at 12:29

Community

1
1

answered Mar 11 '12 at 10:08

alexis

46,350
14
97
153

Thanks! 'windows-1251' was the encoding that ended up working. – maxko87 Mar 11 '12 at 19:08

score 1 · Answer 2 · answered Mar 11 '12 at 10:55

1

Try this:

doc = urllib.open('http://yandex.ru').read()
doc = doc.decode('utf-8')

That's all ;)

answered Mar 11 '12 at 10:55

Denis

6,591
6
36
57

Cyrillic text extraction in Python/Django

2 Answers2