12

I have a large csv file of 3.5 go and I want to read it using pandas.

This is my code:

import pandas as pd
tp = pd.read_csv('train_2011_2012_2013.csv', sep=';', iterator=True, chunksize=20000000, low_memory = False)
df = pd.concat(tp, ignore_index=True)

I get this error:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:8771)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:23325)()

CParserError: Error tokenizing data. C error: out of 

The capacity of my ram is 8 Go.

ettanany
  • 17,505
  • 6
  • 41
  • 60
Amal Kostali Targhi
  • 897
  • 3
  • 11
  • 22
  • what about just `pd.read_csv('train_2011_2012_2013.csv', sep=';')` ? – Zeugma Dec 23 '16 at 14:35
  • In addition to any other suggestions, you should also specify `dtypes`. – 3novak Dec 23 '16 at 14:49
  • @Boud my computer don't support it – Amal Kostali Targhi Dec 23 '16 at 21:42
  • Noobie's answer above is using even more memory because you are loading a chunk and appending it to mylist (creating a second copy of the data). You should read in a chunk , process it, store the result , then continue reading next chunk. Also , setting dtype for columns will reduce memory. – marneezy May 23 '17 at 18:55

4 Answers4

22

try this bro:

mylist = []

for chunk in  pd.read_csv('train_2011_2012_2013.csv', sep=';', chunksize=20000):
    mylist.append(chunk)

big_data = pd.concat(mylist, axis= 0)
del mylist
ℕʘʘḆḽḘ
  • 17,138
  • 32
  • 109
  • 206
  • Thanks for your help but there an error in big_data = pd.concat(mylist, axis=0) out = np.empty(out_shape, dtype=dtype, order='F') 929 else: --> 930 out = np.empty(out_shape, dtype=dtype) 931 932 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis, MemoryError: – Amal Kostali Targhi Dec 23 '16 at 16:53
  • Loaded 3G CVS successfully! Thanks! – hzitoun Jul 21 '18 at 22:16
  • 1
    Just came across this. Perfect! – Kokokoko Oct 11 '19 at 11:56
  • I am reading two big csv files one after another. This is not working. Any suggestions please? My csv size is 980 MB – Hasnu zama Jun 10 '20 at 10:39
2

You may try setting error_bad_lines = False when calling the csv file i.e.

import pandas as pd
df = pd.read_csv('my_big_file.csv', error_bad_lines = False)
Dutse I
  • 31
  • 2
2

This error could also be caused by the chunksize=20000000. Decreasing that fixed the issue in my case. In ℕʘʘḆḽḘ's solution chunksize is also decreased which might have done the trick.

Justas
  • 99
  • 2
  • 6
  • 1
    If it is already answered in ℕʘʘḆḽḘ's solution then just comment this. No need to put it as an answer. – Mark Melgo Mar 05 '19 at 08:13
  • 3
    I wanted to do that but didn't have enough reputation. Just wanted to leave this info for future reference, I haven't found it when I was googling for this error – Justas Mar 05 '19 at 10:58
0

You may try to add parameter engine='python. It loads the data slower but it helped in my situation.

Thinker
  • 1
  • 1