318

Here is my code,

for line in open('u.item'):
# Read each line

Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:

for line in open('u.item', encoding='utf-8'):
# Read each line

But again it gives the same error. What should I do then?

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
SujitS
  • 9,821
  • 3
  • 16
  • 39

17 Answers17

572

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
SujitS
  • 9,821
  • 3
  • 16
  • 39
  • 11
    Explicit is better than implicit (PEP 20). – 0 _ Jul 01 '16 at 05:46
  • 9
    The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore! – Kjeld Flarup Apr 12 '18 at 08:53
  • 1
    I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/configparser.py and changed the code to the following **def read(self, filenames, encoding="ISO-8859-1"):** – Евгений Коптюбенко Sep 27 '18 at 14:18
  • 7
    Is there an automatic way of detecting encoding? – OrangeSherbet Jan 29 '19 at 23:20
  • 12
    @OrangeSherbet I implemented detection using `chardet`. Here's the one-liner (after `import chardet`): `chardet.detect(open(in_file, 'rb').read())['encoding']`. Check out this answer for details: https://stackoverflow.com/a/3323810/615422 – VertigoRay Mar 20 '19 at 13:34
  • How do you get the encoding of a file? – JohnAndrews Aug 05 '19 at 15:22
  • Note that `'ISO-8859-1'` will *always* work even if it's not the right encoding, because each of the 256 byte values maps to a Unicode character. I believe it's the only encoding which does this. – Mark Ransom May 04 '20 at 17:03
  • 1
    I like @VertigoRay's suggestion of `chardet` in a script, but for something really quick to diagnose what's going on, a simple `file` helped me: `% file list.log list.log: ISO-8859 text` vs `%file playlist.txt playlist.txt: UTF-8 Unicode text, with CRLF, LF line terminators` – Billy Oct 12 '20 at 18:49
  • 1
    @vertigoray answer should be the accepted one IMHO -- answers without encoding detection cannot reliably solve the question – Fred Zimmerman Sep 10 '21 at 05:02
  • 1
    @OrangeSherbet there's no sure way unless you can find out from whoever produced the file. But it's possible to guess based on the file contents and some guessing methods are better than others. By coincidence I chanced on a new way to do it in Python the other day: [charset-normalizer](https://pypi.org/project/charset-normalizer/). – Mark Ransom Dec 05 '21 at 17:26
75

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")
mkrieger1
  • 14,486
  • 4
  • 43
  • 54
  • 6
    You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help. – RolfBly Oct 26 '17 at 20:26
37

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Mark Ransom
  • 286,393
  • 40
  • 379
  • 604
  • 6
    So, How can I find out what encoding is it! I am using linux – SujitS Oct 31 '13 at 11:35
  • 5
    There is no way to do that that always works, but see the answer to this question: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – RemcoGerlich Oct 31 '13 at 12:37
27

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Shashank
  • 574
  • 6
  • 10
19

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Ayesha Siddiqa
  • 195
  • 1
  • 3
  • Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though. – Mark Ransom Mar 09 '21 at 15:51
15

If you are using Python 2, the following will be the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # Do something

Because the encoding parameter doesn't work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Jeril
  • 6,538
  • 3
  • 47
  • 63
13

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

'rb' is reading the file in binary mode. Read more here.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Ozcar Nguyen
  • 149
  • 1
  • 6
6

You can try this way:

open('u.item', encoding='utf8', errors='ignore')
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Farid Chowdhury
  • 2,011
  • 20
  • 18
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/26211981) – MartenCatcher May 24 '20 at 06:04
  • @MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment – Silidrone Nov 28 '20 at 18:14
  • 2
    What is the intent? Ignoring errors? What are the consequences? – Peter Mortensen Jan 30 '21 at 16:51
6

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.

If your script runs on a Linux OS, you can get the encoding with the file command:

file --mime-encoding <filename>

Here is a python script to do that for you:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))
Alain Cherpin
  • 188
  • 3
  • 5
3

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte

So this is how I fixed it.

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')

Vineet Singh
  • 59
  • 1
  • 8
2

This is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
2

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:

import os
assert os.path.isfile(filepath)
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
xtluo
  • 1,733
  • 14
  • 25
  • How would opening a file that doesn't exist generate a `UnicodeDecodeError`? And in Python it's customary to use [the EAFP principle](https://stackoverflow.com/q/11360858/5987) over the LBYL that you're endorsing here. – Mark Ransom Oct 17 '21 at 03:18
2

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
JGaber
  • 21
  • 3
1

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.

I had problem with .csv file opening with that description:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.

Eric Aya
  • 69,000
  • 34
  • 174
  • 243
  • Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet. – Thierry Lathuille Aug 03 '21 at 10:01
  • 1
    @ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue? – Nikita Axenov Aug 03 '21 at 10:13
  • 1
    This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example https://meta.stackoverflow.com/questions/297673/how-do-i-deal-with-non-english-content ), and the rule is really strictly respected. For questions in Russian, you have https://ru.stackoverflow.com/ , though ;) – Thierry Lathuille Aug 03 '21 at 10:20
  • @ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark). – Anonymous Aug 03 '21 at 19:52
0

The encoding replaced with encoding='ISO-8859-1'

for line in open('u.item', encoding='ISO-8859-1'):

print(line)

0

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')

0

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

The above error is occuring due to encoding

Solution:- Use “encoding='latin-1'”

Reference:- https://pandas.pydata.org/docs/search.html?q=encoding

Kalluri
  • 1
  • 1