0

I'm trying to transform a certain CSV into a pandas dataframe. The CSV has some columns, such as text, display_name and user_status, and this fields can receive commas(,) and quotemarks (" or ') in its values. My CSV's delimiter is also the comma and since all the fields are strings, it has quotemarks in every field (see the example below). When trying to import this to the dataframe, all the lines with this extra commas are skipped (Skipping line 10: Expected 10 fields in line 10, saw 12). How can I deal with it?

CSV header:

text,created_at,verified,followers_count,friends_count,statuses_count,user.created_at,screen_name,name,link

'text_string, it can contain commas and "quotemarks"','2020-04-08 03:00:47','False','278','631','13869','2018-07-03 20:18:49','screen_name','name','tweet_link'
Jan
  • 40,932
  • 8
  • 45
  • 77
olenscki
  • 479
  • 3
  • 16
  • How are you importing your data? Can you show us your code? – trotta Apr 22 '20 at 14:04
  • 1
    You should use `'` as quote parameter for the parser (default is probably `"`). – Danny_ds Apr 22 '20 at 14:06
  • @trotta yes: `df = pd.read_csv(file_path, sep=",", header=0, error_bad_lines=False, encoding = "utf-8", engine='python')`. I'm using `error_bad_lines=False` since the code crashs if not. – olenscki Apr 22 '20 at 14:09

3 Answers3

4

you can use the following code

df = pd.read_csv('data.csv', quotechar="'")

stolen from here pandas read csv with extra commas in column

hatef alipoor
  • 3,601
  • 1
  • 10
  • 28
3

Try setting the quotechar in pandas.read_csv:

df = pd.read_csv("tweets.csv", quotechar="'")

Nathan Mathews
  • 157
  • 1
  • 5
1

If you can, open your CSV in a text editor and use a regular expression to change the separation commas to something else, e.g. a semicolon. Search for commas sandwiched between quotemarks.

find: ','
replace: ';'

Save your file again and specify the semicolon as separation character:

foo = pd.read_csv('commas.csv', sep=';')
Hans Roelofsen
  • 752
  • 1
  • 7
  • 13