Is there a way to convert to unicode the text in a file? in Python

Question

i'm writing a scraping code from a brazilian page and i'm writing the result in to a file, the thing is that the result i get from the code it's no supported in ASCII and gave me this error:

File "testUnicode.py", line 6 SyntaxError: Non-ASCII character '\xc3' in file testUnicode.py on line 6, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

so i found an answer here to solve that error:

file.write(news.encode('uft8'))

and it worked because it took me off the error but the thing is that i'm still getting the info in a bad way, like this:

Em tom de desabafo, peemedebista diz que, no 1Âº mandato, foi um 'vice decorativo' CoalizÃ£o diz que usarÃ¡ sua maioria na Assembleia para libertar antichavistas Segundo autoridades, casal acusado das mortes estava 'radicalizado havia algum tempo' Entre as mulheres, Ãndice vai a 52%; maioria da populaÃ§Ã£o aprova movimentos feministas Manifestantes bloqueiam ruas contra a reorganizaÃ§Ã£o das escolas; houve discussÃ£o com motoristas Animalzinho Ã© menor que um grÃ£o de gergelim

is there a way to solve this problem?

You need to know what encoding the original text is in. – BrenBarn Dec 08 '15 at 04:40 — BrenBarn, Dec 08 '15 at 04:40
I dont think it is `utf-8`.Use the proper encoding – vks Dec 08 '15 at 04:41 — vks, Dec 08 '15 at 04:41

score 0 · Answer 1 · answered Dec 08 '15 at 05:01

The original error:

File "testUnicode.py", line 6 
  SyntaxError: Non-ASCII character '\xc3' in file testUnicode.py on line 6, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

is caused because your file has UTF-8 characters, please declare an encoding with:

# -*- coding: utf-8 -*-

The second problem is caused because whatever is taking your text is interpreting it as latin1 encoding instead of utf8, e.g.

c = u'\u00e3'  # Codepoint for LATIN SMALL LETTER A WITH TILDE

c.encode('utf8')  # UTF8 encoding produces 2 bytes
>>> '\xc3\xa3'

# Those bytes, read as latin1
print c.encode('utf8').decode('latin1')
>>> Ã£

# E.g. \xc3 => Ã
#      \xa3 => £

So you file IS written as utf8 but read as latin1.

Is there a way to convert to unicode the text in a file? in Python

1 Answers1