UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Question

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.

Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <john.doe@example.com>
...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

does the string come out of a file or a socket? could you please post code examples of how the string is encoded end decoded before it is send through the socket/filehandler? — devsnd, Sep 17 '12 at 23:05
Did I write or didn't I write that the string comes over the socket? I simply read the string from the socket and with to put it in a dictionary and then JSON it to send it along. The JSON function failed due to those characters. — transilvlad, Sep 18 '12 at 09:05

score 401 · Accepted Answer · edited Mar 22 '19 at 17:31

401

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

edited Mar 22 '19 at 17:31

Max Ghenis

12,769
13
73
119

answered Sep 17 '12 at 23:05

transilvlad

13,166
13
44
77

60

Yes, though this is usually bad practice/dangerous, because you'll just lose characters. Better to determine or detect the encoding of the input string and decode it to unicode first, then encode as UTF-8, for example: `str.decode('cp1252').encode('utf-8')` – Ben Hoyt Sep 17 '12 at 23:15
1

In some cases yes you are right it might cause problems. In my case I don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. – transilvlad Sep 18 '12 at 09:24
This one actually helps if the content of the string is actually invalid, in my case `'\xc0msterdam'` which turns in to `u'\ufffdmsterdam'` with replace – PvdL Jan 04 '16 at 21:44
9

if you ended up here because you are having problems reading a file, opening the file in binary mode might help: `open(file_name, "rb")` and then apply Ben's approach from the comments above – kristian Nov 11 '16 at 17:18
the same option applies to even more, e.g. to "something.decode()" – Alexander Stohr Mar 17 '20 at 15:31
"For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application." That still allows input that is valid UTF-8 but not valid ASCII. – Sören Mar 14 '22 at 15:31
How can I import `unicode `? – alper Mar 16 '22 at 10:23

score 130 · Answer 2 · answered Feb 12 '18 at 17:08

130

Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c')

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

No errors for me.

answered Feb 12 '18 at 17:08

Doğuş

1,677
1
13
21

3

that's actually a good solution. i dont know why it was downvoted. – ℕʘʘḆḽḘ Feb 15 '18 at 18:34
This could be not a good idea if you have a huge `csv` file. It could lead you to an `OutOfMemory` error or an automatic restart of your notebook's kernel. You should set the `encoding` on this case. – LucasBr Apr 06 '19 at 13:51
1

Excellent answer. Thank You. This worked for me. I had "? " inside a diamond shape character that was causing the issue. With plain eyes i had ' " " which is inch. I did 2 things to figure out. a) df = pd.read_csv('test.csv', n_rows=10000). This worked perfectly without the engine. So i incremented the n_rows to figure out which row had error. b) df = pd.read_csv('test.csv', engine='python') . This worked and i printed the errored row using df.iloc[36145], this printed me the errored record. – Jagannath Banerjee Sep 26 '19 at 12:46
2

this worked for me too... Not sure what is happening 'under the hood' and if this is actually a nice/good/proper solution in all cases, but it did the trick for me ;) – Chrisvdberge Jan 06 '20 at 07:34
1

Although it worked for me, I find it *so* not intuitive.. How in the world I would figure it out with out someone point it out? I am curious to know from where it came from... – Green Feb 11 '20 at 10:59

James McCormac · Answer 3 · 2016-12-27T23:16:09.793

76

This type of issue crops up for me now that I've moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.

edited Dec 27 '16 at 23:16

answered Jun 09 '16 at 10:21

James McCormac

1,487
1
10
25

the link is broken as of 2021-10-09 – ofloveandhate Oct 09 '21 at 16:18
As of 2022-02-12 using Python 3.8 I have no problems. – alexsmail Feb 12 '22 at 21:04

score 36 · Answer 4 · answered Sep 17 '12 at 23:06

36

>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ

answered Sep 17 '12 at 23:06

Ignacio Vazquez-Abrams

740,318
145
1,296
1,325

19

I'm confused, how did you choose cp1252? It worked for me, but why ? I don't know and now I'm lost :/. Could you elaborate ? Thanks a lot ! :) – Cyril N. Aug 22 '13 at 13:34
4

Could you present an option that works for all characters? Is there a way to detect the characters that need to be decoded so a more generic code can be implemented? I see many people are looking at this and I bet for some discarding is not the desired option like it is for me. – transilvlad Sep 16 '13 at 14:19
As you can see this question has quite the popularity. Think you could expand your answer with a more generic solution? – transilvlad Nov 26 '13 at 15:41
16

There is no more generic solution to "Guess the encoding roulette" – Puppy Feb 02 '15 at 10:23
7

found it using a combination of web search, luck and intuition: [cp1252](https://en.wikipedia.org/wiki/Windows-1252) was `used by default in the legacy components of Microsoft Windows in English and some other Western languages` – bolov Nov 28 '15 at 21:58
When I try to convert result into hex I am gettting following error: `Cannot convert str to hex string` – alper Mar 16 '22 at 10:53

score 36 · Answer 5 · answered May 31 '19 at 03:21

36

the first,Using get_encoding_type to get the files type of encode:

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

the second, opening the files with the type:

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')

answered May 31 '19 at 03:21

Ivan Lee

2,726
4
26
43

9

what happens when it return None – Chop Labalagun Jan 27 '20 at 19:41

score 29 · Answer 6 · answered Mar 13 '17 at 11:19

29

I had same problem with UnicodeDecodeError and i solved it with this line. Don't know if is the best way but it worked for me.

str = str.decode('unicode_escape').encode('utf-8')

answered Mar 13 '17 at 11:19

maiky_forrester

588
4
19

score 18 · Answer 7 · edited Jan 20 '21 at 11:04

18

This solution works nice when using Latin American accents, such as 'ñ'.

I have solved this problem just by adding

df = pd.read_csv(fileName,encoding='latin1')

edited Jan 20 '21 at 11:04

Community

1
1

answered Jun 03 '20 at 18:09

Talha Rasool

1,044
12
11

Worked for me too, but I wonder what's going to happen to the Chinese, Greek and Russian named media on my drive. To be continued... – Sridhar Sarnobat Dec 13 '21 at 05:11

Sathiamoorthy · Answer 8 · 2021-06-13T09:53:07.297

6

I have resolved this problem using this code

df = pd.read_csv(path, engine='python')

edited Jun 13 '21 at 09:53

answered Nov 03 '20 at 12:00

Sathiamoorthy

6,942
6
53
64

score 4 · Answer 9 · answered Apr 10 '14 at 11:26

4

Just in case of someone has the same problem. I'am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.

answered Apr 10 '14 at 11:26

http8086

884
9
28

3

How does this relate to this question? – transilvlad Apr 10 '14 at 12:13
1

Exactly the same, if you know how youcompleteme works. Ycm plugin is socket architecture, communication between client and server is using socket, both are python modules, not able to decode the packets if the encoding setting is incorrect – http8086 Apr 10 '14 at 12:20
I have the same problem. Can you please tell me where to put `export LC_CTYPE="en_US.UTF-8"`? – Reman Jun 17 '14 at 07:59
@Remonn hi, you know we have profile file for bash? Put inside. – http8086 Jun 17 '14 at 09:56
@hylepo, I'm on a windows system :) – Reman Jun 18 '14 at 12:15

score 3 · Answer 10 · edited Mar 11 '18 at 14:18

3

What can you do if you need to make a change to a file, but don’t know the file’s encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the surrogateescape error handler:

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
    data = f.read()

edited Mar 11 '18 at 14:18

Krisztián Balla

17,664
13
63
76

answered Mar 11 '18 at 12:45

Kothapati Purandhar Reddy

51
1

This caused my notebook to crash. – Jie Mar 30 '21 at 00:51

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

10 Answers10

Linked

Related