37

I am trying to read in a dataset called df1, but it does not work

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")

df1.head()

Here are huge errors from the above code, but this is the most relevant

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
Henry Ecker
  • 31,792
  • 14
  • 29
  • 50
Tuyen
  • 716
  • 1
  • 6
  • 16

4 Answers4

63

The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:

b'Korea, Dem. People\x92s Rep.'

Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, :

df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
                  sep=";", encoding='cp1252')

Demo:

>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
...                   sep=";", encoding='cp1252')
>>> df1.head()
                   2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
0     Afghanistan  55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
1         Albania  74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
2         Algeria  70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..

   2010  2011  2012  2013  Unnamed: 15  2014  2015
0  59.0  59.3  59.7  60.0          NaN  60.4  60.7
1  77.0  77.2  77.4  77.6          NaN  77.8  78.0
2  73.8  74.1  74.3  74.6          NaN  74.8  75.0
3    ..    ..    ..    ..          NaN    ..    ..
4    ..    ..    ..    ..          NaN    ..    ..

I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:

>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'

This is a known bug in Pandas. You can work around this by using urllib.request to load the URL and pass that to pd.read_csv() instead:

>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
...     df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
Martijn Pieters
  • 963,270
  • 265
  • 3,804
  • 3,187
  • 4
    Hi Martijin How could you know its encoding is cp1252? – Tuyen Sep 01 '17 at 12:31
  • 4
    @Tuyen: experience. – Martijn Pieters Sep 01 '17 at 12:32
  • You can also say `encoding='latin1`. Works for me. – RajeshM Mar 10 '21 at 17:16
  • @RajeshM that could be because latin1 “works” on *any* file. **That doesn’t mean that the decoded text will be readable or without issues.** Windows-1252 and Latin-1 are closely related but *not the same*. If you get weird characters in your result you picked the wrong codec. – Martijn Pieters Mar 11 '21 at 08:34
  • This codec will avoid the error (like ISO 8859-1) but both have an issue with "don't" and similar-> turns into a root symbol. Source: a csv created from US based Excel 2010. Also, should it be mentioned that cp1252 is listed as "Western European"? – DISC-O May 23 '22 at 18:49
  • @DISC-O: Not sure what you mean by "root symbol"; the [SQUARE ROOT `√` symbol](https://www.fileformat.info/info/unicode/char/221a/index.htm) [is not part of cp1252](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) so presumably you have something else? Do you know the (hex or decimal) value of the specific byte? If the original dataset used a Windows 125x series codepage, it should be hex 92, decimal 146, for a right single quotation mark, `’`. – Martijn Pieters May 27 '22 at 12:03
  • @DISC-O: "US based Excel 2010" doesn't tell me anything, unfortunately; the [default should be 1252, apparently](https://stackoverflow.com/questions/508558/what-charset-does-microsoft-excel-use-when-saving-files). "Western Europe" is *one* of the names that Microsoft uses for the codec; the misnomer "ANSI Latin 1" is also used, historically. I'm not sure what mentioning it will add to the answer? – Martijn Pieters May 27 '22 at 12:20
6

It turned out that the csv created in mac os is being parsed on a windows machine, I got the UnicodeDecodeError. To get rid of this error, try passing argument encoding='mac-roman' to read_csv method of pandas library.

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()

Output:

    2000    2001    2002    2003    2004    2005    2006    2007    2008    2009    2010    2011    2012    2013    Unnamed: 15 2014    2015
0   Afghanistan 55.1    55.5    55.9    56.2    56.6    57.0    57.4    57.8    58.2    58.6    59.0    59.3    59.7    60.0    NaN 60.4    60.7
1   Albania 74.3    74.7    75.2    75.5    75.8    76.1    76.3    76.5    76.7    76.8    77.0    77.2    77.4    77.6    NaN 77.8    78.0
2   Algeria 70.2    70.6    71.0    71.4    71.8    72.2    72.6    72.9    73.2    73.5    73.8    74.1    74.3    74.6    NaN 74.8    75.0
3   American Samoa  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..
4   Andorra ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..
navule
  • 2,420
  • 1
  • 31
  • 45
  • Like Latin-1 / ISO-8859-1, the [Mac OS Roman characterset](https://en.wikipedia.org/wiki/Mac_OS_Roman) maps all 256 possible byte values to a single character so will **never** result in an decode error and will work on any file. That doesn't mean that you used the correct codec, you may still get weird characters in your dataset as a result. – Martijn Pieters May 27 '22 at 12:09
0

This problem occur because of some unknown characters in your file. for example, In your file with utf-8 encoding, there were some character in windows 1250. you should remove or replace this characters to solve your problems

AM80
  • 36
  • 2
-1

This works

df = pd.read_csv(inputfile, engine = 'python')

  • Please read "[answer]" and "[Explaining entirely code-based answers](https://meta.stackoverflow.com/q/392712/128421)". It helps more if you supply an explanation why this is the preferred solution and explain how it works. We want to educate, not just provide code. – the Tin Man Mar 20 '22 at 23:07