837

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:

Traceback (most recent call last):  
   File "SCRIPT LOCATION", line NUMBER, in <module>  
     text = file.read()` 
   File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode  
     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`  
Aran-Fey
  • 35,525
  • 9
  • 94
  • 135
Eden Crow
  • 12,382
  • 11
  • 25
  • 24
  • 3
    For the same error these solution has helped me , [solution of charmap error](https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c) – Shubham Sharma Sep 14 '17 at 11:58
  • 4
    See [Processing Text Files in Python 3](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html) to understand why you get this error. – Andreas Haferburg Apr 24 '18 at 14:33
  • 4
    For Python > 3.6, set the interpreter option (argument) to include `-Xutf8` (that should fix it). – Arthur MacMillan Nov 07 '21 at 11:22

11 Answers11

1487

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.

You specify the encoding when you open the file:

file = open(filename, encoding="utf8")
fat
  • 5,680
  • 5
  • 42
  • 65
Lennart Regebro
  • 158,668
  • 41
  • 218
  • 248
  • 25
    Cool, I had that problem with some Python 2.7 code that I tried to run in Python 3.4. Latin-1 worked for me! – 1vand1ng0 Apr 14 '15 at 08:56
  • 3
    if you're using Python 2.7, and getting the same error, try the `io` module: `io.open(filename,encoding="utf8")` – christopherlovell Jun 03 '15 at 14:02
  • +1 for specifying the encoding on read. p.s. is it supposed to be encoding="utf8" or is it encoding="utf-8" ? – Davos Feb 03 '16 at 23:03
  • 13
    @1vand1ng0: of course Latin-1 works; it'll work for any file regardless of what the actual encoding of the file is. That's because all 256 possible byte values in a file have a Latin-1 codepoint to map to, but that doesn't mean you get legible results! If you don't know the encoding, even opening the file in binary mode instead might be better than assuming Latin-1. – Martijn Pieters Mar 06 '17 at 14:10
  • I get the OP error even though the encoding is already specified correctly as UTF-8 (as shown above) in open(). Any ideas? – enahel Nov 15 '17 at 07:11
  • thanks! My program worked well in ubuntu python3 but in windows OS, giving errors, so I supposed I'll have to specify the encoding wherever I'm readiing or writing non-latin text. Funny, I'd assumed python3 is unicode by default. – Nikhil VJ Feb 10 '18 at 14:18
  • 1
    It is unicode by default, but unicode is not an encoding. https://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/ – Lennart Regebro Feb 16 '18 at 16:16
  • after I used this that error got resolved but then I got this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position – Mona Jalal Apr 01 '18 at 18:58
  • Then it's not UTF8. – Lennart Regebro Apr 05 '18 at 15:44
  • 3
    `filename = "C:\Report.txt" with open(filename,encoding ="utf8") as my_file: text = my_file.read() print(text)` even after using this I am getting the same error. I have also tried with other encoding but all in vain. In this code I am also using `from geotext import GeoText`. Please suggest a solution. – Salah Jun 04 '18 at 14:37
  • @Salah, since all else has failed you can try the bottom-most answer from Declan Nnadozie. It may not provide fully legible results but depending on your application this may still be acceptable. – JDM Feb 06 '19 at 13:26
  • @MartijnPieters thanks! your comment was more general and is worth an answer :) – vrintle May 31 '19 at 11:24
  • 1
    @MartijnPieters thank you for your answer it helped a lot, Also using `errors='ignore'` works fine too. – Ashwaq Aug 22 '19 at 08:46
  • The suggested encoding string should have a dash and therefore it should be: open(csv_file, encoding='utf-8') (as tested on Python3) – rob_7cc Jan 04 '21 at 16:34
  • Helped so much, thank you! Are all txt files in utf-8 encoding by default? – Password-Classified Jul 01 '21 at 18:12
  • 1
    @Password-Classified No, there is no default for txt files. On Windows it's usually some Windows only encoding by default. – Lennart Regebro Jul 07 '21 at 18:42
110

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Ben
  • 10,992
  • 3
  • 29
  • 62
Declan Nnadozie
  • 1,579
  • 1
  • 9
  • 19
  • 1
    Many thanks - I will give this a try. There are some invalid characters in parts of files I do not care about. – Stephen Nutt Sep 24 '18 at 15:08
  • 18
    Warning: This will result in data loss when unknown characters are encountered (which may be fine depending on your situation). – Hans Goldman Feb 28 '19 at 00:46
  • 2
    The suggested encoding string should have a dash and therefore it should be: open(csv_file, encoding='utf-8') (as tested on Python3) – rob_7cc Jan 04 '21 at 16:35
  • Thanks ignoring the errors worked for me – Ayan Apr 17 '22 at 10:32
63

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:

open(filename, 'rb')

where r = reading, b = binary

MendelG
  • 8,523
  • 3
  • 16
  • 34
Kyle Parisi
  • 1,116
  • 1
  • 10
  • 14
  • Perhaps emphasize that the `b` will produce `bytes` instead of `str` data. Like you note, this is suitable if you don't need to process the bytes in any way. – tripleee May 10 '22 at 07:23
39

As an extension to @LennartRegebro's answer:

If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.

EDIT: (Copied from comment)

A quite popular text editor Sublime Text has a command to display encoding if it has been set...

  1. Go to View -> Show Console (or Ctrl+`)

enter image description here

  1. Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

enter image description here

Stevoisiak
  • 20,148
  • 23
  • 110
  • 201
Matas Vaitkevicius
  • 54,146
  • 29
  • 227
  • 241
  • 2
    Some text editors will provide this information as well. I know that with vim you can get this via `:set fileencoding` ([from this link](http://superuser.com/questions/28779/how-do-i-find-the-encoding-of-the-current-buffer-in-vim)) – PaxRomana99 Dec 17 '16 at 15:20
  • 4
    Sublime Text, also -- open up the console and type `view.encoding()`. – JimmidyJoo Jul 12 '17 at 20:27
  • alternatively, you can open your file with notepad. 'Save As' and you shall see a drop-down with the encoding used – don_Gunner94 Mar 05 '20 at 12:11
27

TLDR: Try: file = open(filename, encoding='cp437')

Why? When one uses:

file = open(filename)
text = file.read()

Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.

If such characters are unneeded, one may decide to replace them by question marks, with:

file = open(filename, errors='replace')

Another workaround is to use:

file = open(filename, errors='ignore')

The characters are then left intact, but other errors will be masked too.

A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):

file = open(filename, encoding='cp437')

Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Olivia Stork
  • 4,491
  • 5
  • 25
  • 39
rha
  • 391
  • 3
  • 4
  • Probably you should emphasize even more that randomly guessing at the encoding is likely to produce garbage. You have to _know_ the encoding of the data. – tripleee May 10 '22 at 07:21
14

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:

open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')

Godspeed

E.Zolduoarrati
  • 1,203
  • 1
  • 7
  • 8
3

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.

Open the file in Notepad++. In the bottom right it will tell you the current file encoding. In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Antoni
  • 2,359
  • 18
  • 21
  • Only the viewing encoding is changed in this way. In order to effectively change the file's encoding, change preferences in Notepad++ and create a new document, as shown here: https://superuser.com/questions/1184299/is-there-a-way-to-force-notepad-encoding-to-windows-1252. – hanna Aug 06 '20 at 10:36
3

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)

and then consider removing it from the file.

hanna
  • 542
  • 7
  • 14
  • I have a web page at https://tripleee.github.io/8bit/#90 where you can look up the character's value in the various 8-bit encodings supported by Python. With enough data points, you can often infer a suitable encoding (though some of them are quite similar, and so establishing _exactly_ which encoding the original writer used will often involve some guesswork, too). – tripleee May 10 '22 at 07:24
2

for me encoding with utf16 worked

file = open('filename.csv', encoding="utf16")
gabi939
  • 79
  • 8
1

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).

Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

0

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1) enter image description here

SuperStormer
  • 4,531
  • 5
  • 20
  • 32
Piyush raj
  • 11
  • 2