0

i'm writing a scraping code from a brazilian page and i'm writing the result in to a file, the thing is that the result i get from the code it's no supported in ASCII and gave me this error:

File "testUnicode.py", line 6 SyntaxError: Non-ASCII character '\xc3' in file testUnicode.py on line 6, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

so i found an answer here to solve that error:

file.write(news.encode('uft8'))

and it worked because it took me off the error but the thing is that i'm still getting the info in a bad way, like this:

Em tom de desabafo, peemedebista diz que, no 1º mandato, foi um 'vice decorativo' Coalizão diz que usará sua maioria na Assembleia para libertar antichavistas Segundo autoridades, casal acusado das mortes estava 'radicalizado havia algum tempo' Entre as mulheres, índice vai a 52%; maioria da população aprova movimentos feministas Manifestantes bloqueiam ruas contra a reorganização das escolas; houve discussão com motoristas Animalzinho é menor que um grão de gergelim

is there a way to solve this problem?

AJ Ze
  • 57
  • 2
  • 6

1 Answers1

0

The original error:

File "testUnicode.py", line 6 
  SyntaxError: Non-ASCII character '\xc3' in file testUnicode.py on line 6, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

is caused because your file has UTF-8 characters, please declare an encoding with:

# -*- coding: utf-8 -*-

The second problem is caused because whatever is taking your text is interpreting it as latin1 encoding instead of utf8, e.g.

c = u'\u00e3'  # Codepoint for LATIN SMALL LETTER A WITH TILDE

c.encode('utf8')  # UTF8 encoding produces 2 bytes
>>> '\xc3\xa3'

# Those bytes, read as latin1
print c.encode('utf8').decode('latin1')
>>> ã

# E.g. \xc3 => Ã
#      \xa3 => £ 

So you file IS written as utf8 but read as latin1.

memoselyk
  • 3,765
  • 1
  • 15
  • 27