"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

Question

Here is my code,

for line in open('u.item'):
# Read each line

Whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(). The code looks like:

for line in open('u.item', encoding='utf-8'):
# Read each line

But again it gives the same error. What should I do then?

Possible duplicate of [Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?](https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror) — tripleee, Jan 17 '19 at 09:03
We had this error with msgpack when using python 3 instead of python 2.7. For us, the course of action was to work with python 2.7. — Jesse W. Collins, Jun 05 '19 at 15:43

score 572 · Accepted Answer · edited Jan 30 '21 at 16:39

572

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

edited Jan 30 '21 at 16:39

Peter Mortensen

30,030
21
100
124

answered Oct 31 '13 at 12:35

SujitS

9,821
3
16
39

11

Explicit is better than implicit (PEP 20). – 0 _ Jul 01 '16 at 05:46
9

The trick is that ISO-8859-1 or Latin_1 is 8 bit character sets, thus all garbage has a valid value. Perhaps not useable, but if you want to ignore! – Kjeld Flarup Apr 12 '18 at 08:53
1

I had the same issue UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 32: invalid continuation byte. I used python 3.6.5 to install aws cli. And when I tried aws --version it failed with this error. So I had to edit /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/configparser.py and changed the code to the following **def read(self, filenames, encoding="ISO-8859-1"):** – Евгений Коптюбенко Sep 27 '18 at 14:18
7

Is there an automatic way of detecting encoding? – OrangeSherbet Jan 29 '19 at 23:20
12

@OrangeSherbet I implemented detection using `chardet`. Here's the one-liner (after `import chardet`): `chardet.detect(open(in_file, 'rb').read())['encoding']`. Check out this answer for details: https://stackoverflow.com/a/3323810/615422 – VertigoRay Mar 20 '19 at 13:34
How do you get the encoding of a file? – JohnAndrews Aug 05 '19 at 15:22
Note that `'ISO-8859-1'` will *always* work even if it's not the right encoding, because each of the 256 byte values maps to a Unicode character. I believe it's the only encoding which does this. – Mark Ransom May 04 '20 at 17:03
1

I like @VertigoRay's suggestion of `chardet` in a script, but for something really quick to diagnose what's going on, a simple `file` helped me: `% file list.log list.log: ISO-8859 text` vs `%file playlist.txt playlist.txt: UTF-8 Unicode text, with CRLF, LF line terminators` – Billy Oct 12 '20 at 18:49
1

@vertigoray answer should be the accepted one IMHO -- answers without encoding detection cannot reliably solve the question – Fred Zimmerman Sep 10 '21 at 05:02
1

@OrangeSherbet there's no sure way unless you can find out from whoever produced the file. But it's possible to guess based on the file contents and some guessing methods are better than others. By coincidence I chanced on a new way to do it in Python the other day: [charset-normalizer](https://pypi.org/project/charset-normalizer/). – Mark Ransom Dec 05 '21 at 17:26

score 75 · Answer 2 · edited Aug 26 '21 at 16:33

75

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

edited Aug 26 '21 at 16:33

mkrieger1

14,486
4
43
54

answered Oct 26 '17 at 19:49

Ryoji Kuwae Neto

887
6
6

6

You may be correct that the OP is reading ISO 8859-1, as can be deduced from the 0xe9 (é) in the error message, but you should explain why your solution works. The reference to speech recognitions API's does not help. – RolfBly Oct 26 '17 at 20:26

score 37 · Answer 3 · edited Jan 30 '21 at 16:29

37

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding, for example, the 0xe9 would be the character é.

edited Jan 30 '21 at 16:29

Peter Mortensen

30,030
21
100
124

answered Oct 31 '13 at 05:58

Mark Ransom

286,393
40
379
604

6

So, How can I find out what encoding is it! I am using linux – SujitS Oct 31 '13 at 11:35
5

There is no way to do that that always works, but see the answer to this question: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – RemcoGerlich Oct 31 '13 at 12:37

score 27 · Answer 4 · edited Jan 30 '21 at 16:35

27

Try this to read using Pandas:

pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

edited Jan 30 '21 at 16:35

Peter Mortensen

30,030
21
100
124

answered Jan 31 '17 at 20:35

Shashank

574
6
10

1

Not sure why you're suggesting Pandas. The solution is setting the correct encoding, which you've chanced upon here. – Alastair McCormack Jan 07 '20 at 10:34
'latin-1' is the same as 'ISO-8859-1'? – Peter Mortensen Jan 30 '21 at 16:37
1

@PeterMortensen yes it is, [Wikipedia confirms it](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). They both produce the same output when used with `decode` in Python as well. – Mark Ransom Mar 09 '21 at 15:49

score 19 · Answer 5 · edited Jan 30 '21 at 16:49

19

This works:

open('filename', encoding='latin-1')

Or:

open('filename', encoding="ISO-8859-1")

edited Jan 30 '21 at 16:49

Peter Mortensen

30,030
21
100
124

answered Feb 17 '20 at 19:45

Ayesha Siddiqa

195
1
3

Depends on what you mean by "works". If you mean avoids exceptions that's true, because it's the only encoding that doesn't have invalid bytes or sequences. Doesn't mean you'll get the proper characters though. – Mark Ransom Mar 09 '21 at 15:51

score 15 · Answer 6 · edited Jan 30 '21 at 16:36

15

If you are using Python 2, the following will be the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # Do something

Because the encoding parameter doesn't work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

edited Jan 30 '21 at 16:36

Peter Mortensen

30,030
21
100
124

answered Mar 03 '17 at 17:32

Jeril

6,538
3
47
63

3

But this is version 3 – SujitS Mar 03 '17 at 17:40
2

Yeah I know. I thought it might be helpful for the people using `Python 2` – Jeril Mar 03 '17 at 18:06
1

Worked for me in Python3 as well – fenkerbb Sep 27 '17 at 21:19
3

In case you want something easier to remember, `'ISO-8859-1'` is also known as `'latin-1'` or `'latin1'`. – Max Candocia Jan 11 '18 at 15:54

score 13 · Answer 7 · edited Jan 30 '21 at 16:48

13

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

'rb' is reading the file in binary mode. Read more here.

edited Jan 30 '21 at 16:48

Peter Mortensen

30,030
21
100
124

answered May 02 '19 at 02:15

Ozcar Nguyen

149
1
6

score 6 · Answer 8 · edited Jan 30 '21 at 16:50

6

You can try this way:

open('u.item', encoding='utf8', errors='ignore')

edited Jan 30 '21 at 16:50

Peter Mortensen

30,030
21
100
124

answered May 23 '20 at 19:53

Farid Chowdhury

2,011
20
18

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/26211981) – MartenCatcher May 24 '20 at 06:04
@MartenCatcher yeah but it helps future visitors to the question, although more explanation put to the answer would make it much better, I believe it serves better purpose as an answer rather than as a comment – Silidrone Nov 28 '20 at 18:14
2

What is the intent? Ignoring errors? What are the consequences? – Peter Mortensen Jan 30 '21 at 16:51

score 6 · Answer 9 · answered Aug 30 '21 at 05:19

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.

If your script runs on a Linux OS, you can get the encoding with the file command:

file --mime-encoding <filename>

Here is a python script to do that for you:

import sys
import subprocess

if len(sys.argv) < 2:
    print("Usage: {} <filename>".format(sys.argv[0]))
    sys.exit(1)

def find_encoding(fname):
    """Find the encoding of a file using file command
    """

    # find fullname of file command
    which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
    if which_run.returncode != 0:
        print("Unable to find 'file' command ({})".format(which_run.returncode))
        return None

    file_cmd = which_run.stdout.decode().replace('\n', '')

    # run file command to get MIME encoding
    file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE)
    if file_run.returncode != 0:
        print(file_run.stderr.decode(), file=sys.stderr)

    # return  encoding name only
    return file_run.stdout.decode().split()[1]

# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

I was looking for an answer and interestingly you've answered 7 hours ago to a question asked 8 years ago. interesting coincidence . — Pooya Estakhri, Aug 30 '21 at 12:46
I don't get it, why would you use a 33-line program to avoid typing one line in the shell? — Mark Ransom, Oct 17 '21 at 03:10

score 3 · Answer 10 · answered Oct 13 '21 at 12:46

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 183: invalid continuation byte

So this is how I fixed it.

import pandas as pd

pd.read_csv('top50.csv', encoding='ISO-8859-1')

score 2 · Answer 11 · edited Jan 30 '21 at 16:34

2

This is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

edited Jan 30 '21 at 16:34

Peter Mortensen

30,030
21
100
124

answered Sep 14 '16 at 19:24

user6832484

21
1

score 2 · Answer 12 · edited Jan 30 '21 at 16:46

2

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:

import os
assert os.path.isfile(filepath)

edited Jan 30 '21 at 16:46

Peter Mortensen

30,030
21
100
124

answered Aug 29 '18 at 03:58

xtluo

1,733
14
25

How would opening a file that doesn't exist generate a `UnicodeDecodeError`? And in Python it's customary to use [the EAFP principle](https://stackoverflow.com/q/11360858/5987) over the LBYL that you're endorsing here. – Mark Ransom Oct 17 '21 at 03:18

score 2 · Answer 13 · edited Jan 30 '21 at 16:53

2

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.

edited Jan 30 '21 at 16:53

Peter Mortensen

30,030
21
100
124

answered Jan 22 '21 at 09:46

JGaber

21
3

Notepad++ is [Windows](https://en.wikipedia.org/wiki/Microsoft_Windows) only. For example, it doesn't work on [Linux](https://en.wikipedia.org/wiki/Linux). – Peter Mortensen Jan 30 '21 at 16:53
What is *"Encodage"*? What language? – Peter Mortensen Jan 30 '21 at 16:55
"Encodage" is "Encoding" if the menu is in French – JGaber Apr 30 '22 at 09:48

score 1 · Answer 14 · edited Aug 03 '21 at 11:15

1

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.

I had problem with .csv file opening with that description:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte

I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.

edited Aug 03 '21 at 11:15

Eric Aya

69,000
34
174
243

answered Aug 03 '21 at 05:39

Nikita Axenov

11
1

Please note that questions and answers on SO must be in English only - even if the problem you encountered may bite mainly programmers using cyrillic alphabet. – Thierry Lathuille Aug 03 '21 at 10:01
1

@ThierryLathuille, is it a real problem? Could you please give me a link/referense to the comunity rule on that issue? – Nikita Axenov Aug 03 '21 at 10:13
1

This is considered a real problem - and is probably what caused your answer to get downvoted. Non-English content is not allowed on SO (see for example https://meta.stackoverflow.com/questions/297673/how-do-i-deal-with-non-english-content ), and the rule is really strictly respected. For questions in Russian, you have https://ru.stackoverflow.com/ , though ;) – Thierry Lathuille Aug 03 '21 at 10:20
@ThierryLathuille This applies to the English content, not problems with non-English symbols. And this doesn't necessarily have to be about other languages, it could be a different UTF-8 character (for example, a checkmark). – Anonymous Aug 03 '21 at 19:52

score 0 · Answer 15 · answered Dec 07 '21 at 06:40

0

The encoding replaced with encoding='ISO-8859-1'

for line in open('u.item', encoding='ISO-8859-1'):

print(line)

answered Dec 07 '21 at 06:40

Anoop Ashware

9
3

score 0 · Answer 16 · answered Mar 16 '22 at 16:31

0

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')

answered Mar 16 '22 at 16:31

SONY ANNEM

1

Nobody said that the file in the question is a csv file. – Arpad Horvath -- Слава Україні Mar 16 '22 at 20:21

Kalluri · Answer 17 · 2022-05-23T13:18:31.850

0

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

The above error is occuring due to encoding

Solution:- Use “encoding='latin-1'”

Reference:- https://pandas.pydata.org/docs/search.html?q=encoding

edited May 23 '22 at 13:18

answered May 23 '22 at 13:18

Kalluri

1
1

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

17 Answers17

print(line)

Linked

Related