29

I'm reading a basic csv file where the columns are separated by commas with these column names:

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

David
  • 1,344
  • 3
  • 14
  • 24

3 Answers3

32

Imagine we're reading your dataframe called comma.csv:

userid, username, body
01, n1, 'string1, string2'

One thing you can do is to specify the delimiter of the strings in the column with:

df = pd.read_csv('comma.csv', quotechar="'")

In this case strings delimited by ' are considered as total, no matter commas inside them.

Fabio Lamanna
  • 18,348
  • 20
  • 86
  • 120
  • 3
    I don't think his string has quotes, because if it did then pandas would recognize it without `quotechar` – Leb Sep 23 '15 at 15:44
  • Like Leb said, I don't have quotes around the body – David Sep 23 '15 at 15:45
  • 1
    @David Ok got it. Your dataframe has been provided "as is", or you can do some preprocessing on it? – Fabio Lamanna Sep 23 '15 at 15:48
  • @Fabio: dataframe is an output of a hive SQL query. I then replace the default tab delimiter with ",". I tried to avoid replacing the tab delimiter and keeping it as is, trying to do `pd.read_csv(...., sep='\t')` but it turns out that some `body` fields have tabs in them as well, so it's all just a mess. – David Sep 23 '15 at 15:52
  • 2
    @David have a look at [this](http://stackoverflow.com/q/14550441/2699288) question, hope that can help you. – Fabio Lamanna Sep 23 '15 at 15:58
  • 1
    @David Why no import straight from the SQL, maybe something like this http://stackoverflow.com/questions/12047193/how-to-convert-sql-query-result-to-pandas-data-structure – Leb Sep 23 '15 at 15:59
  • Thank you to both of you for your feedback. I figured it out with @Fabio's suggested question. – David Sep 24 '15 at 13:17
  • 2
    I tried this example, but does not work for me. Python 3, pandas 0.18.0 reads `comma.csv` but `01` become the index `1`, and `username` become only `'string1` :( – ragesz May 06 '16 at 13:50
11

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

In my case:

n = 5 #define yours
df = pd.read_csv(file,
                 usecols=range(n),
                 lineterminator='\n',
                 header=None)
Ilyas
  • 1,528
  • 14
  • 9
1

Does this help?

import csv
with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f:
    csvread = csv.reader(f)
    batch_data = list(csvread)
    print(batch_data)

Reference:

[1] https://stackoverflow.com/a/40477760/6907424

[2] To combat "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424

hafiz031
  • 1,730
  • 3
  • 17
  • 38