8

Whenever I try to read UTF-8 encoded text files, using open(file_name, encoding='utf-8'), I always get an error saying ASCII codec can't decode some characters (eg. when using for line in f: print(line))

Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'
>>> import sys
>>> sys.getfilesystemencoding()
'ascii'
>>>

and locale command prints:

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=en_HK.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
jm33_m0
  • 565
  • 1
  • 9
  • 16

4 Answers4

6

I had a similar problem. For me, initially the environtment variable LANG was not set (you can check this by running env)

$ python3 -c 'import locale; print(locale.getdefaultlocale())'
(None, None)
$ python3 -c 'import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968

The available locales for me was (on a fresh Ubuntu 18.04 Docker image):

$ locale -a
C
C.UTF-8
POSIX

So i picked the utf-8 one:

$ export LANG="C.UTF-8"

And then things work

$ python3 -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'UTF-8')
$ python3 -c 'import locale; print(locale.getpreferredencoding())'
UTF-8

If you pick a locale that is not avaiable, such as

export LANG="en_US.UTF-8"

it will not work:

$ python3 -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'UTF-8')
$ python3 -c 'import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968

and this is why locale is giving the error messages:

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
RasmusWL
  • 1,544
  • 13
  • 24
0

I solved it by running the following:

apt install locales-all
matson kepson
  • 1,812
  • 15
  • 17
0

By default, Python tries to honor the Unix locale system, including the LC_ALL, LC_CTYPE, and LANG environment variables. In theory, standards are good, but in my experience these variables only cause problems. They're sometimes set to ridiculous values, like non-UTF-8 character sets, for no good reason. Then Python throws errors when print()ing non-ASCII text.

You can fix this by finding out what these environment variables are set to, and why, and change them to something Unicode-capable. But system configuration can be a can of worms.

Python 3.7 and later offer these two quick fixes:

  • Set PYTHONUTF8=1 in the environment when running this script.

  • If you can't do that, then early in your script, force stdout to be UTF-8 by doing

    import sys
    
    sys.stdout.reconfigure(encoding='utf-8')
    
Jason Orendorff
  • 39,955
  • 4
  • 59
  • 96
-1

I think you are misreading the error message. Be careful at distinguishing UnicodeDecodeError and UnicodeEncodeError.

You say that Python complains that “ascii codec can't decode some characters”. However, there is no such error message, as far as I know. Compare the following two cases:

>>> b = 'é'.encode('utf8')
>>> b.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can’t decode byte 0xc3 in position 0: ordinal not in range(128)
>>> 'é'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can’t encode character '\xe9' in position 0: ordinal not in range(128)

It's either “can't decode byte” or “can't encode character”, but it's never “can't decode character”.

This might seem pedantic, but in this line,

for line in f: print(line)

you have both decoding (before the colon) and encoding (the print expression). So you need to be sure which process is causing trouble. One possibility would be to write this in two lines.

However, if f is opened with encoding='utf-8', as you write, then I'm pretty sure the problem is caused by the print expression. print() writes to sys.stdout by default. Since this stream is already open when Python is started, its encoding is already set as well – depending on your environment. Since in your locale LC_ALL is not set, the ASCII default (“ANSI X3.4-1968”) is used (this might answer your question in the title).

If you can't or don't want to change the locale, here's what you can do to send UTF-8 text to STDOUT from within Python:

  • use the underlying binary stream:

    for line in f:
        sys.stdout.buffer.write(line.encode('utf-8')
    
  • re-encode sys.stdout (actually: replace sys.stdout with a re-encoded version):

    import codecs
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
    

In any case, it's still possible that your terminal is unable to properly display UTF-8 text, either because it's uncapable of that or because it's not configured to do so. In that case, you'll probably see question marks or mojibake. But that's a different story, outside of Python's control...

lenz
  • 5,232
  • 5
  • 23
  • 40
  • @skyking When I testet this, I was able to change the return value of `locale.getpreferredencoding()` through setting/unsetting `LC_ALL`. However, the documentation of this function indicates that it's a platform-dependent heuristic guess – I'm not that surprised if things work differently in another environment. The main point is: if you want control over the encoding used, declare it explicitly. – lenz Oct 24 '17 at 14:17
  • @skyking It can also be set via LC_CTYPE or LANG. – Conrad Meyer Oct 30 '17 at 23:15