Writing large Pandas Dataframes to CSV file in chunks

Question

How do I write out a large data files to a CSV file in chunks?

I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.

I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.

I've been looking into reading large data files in chunks into a dataframe. However, I haven't been able to find anything on how to write out the data to a csv file in chunks.

Here is what I'm trying now, but this doesn't append the csv file:

with open(os.path.join(folder, filename), 'r') as src:
    df = pd.read_csv(src, sep='\t',skiprows=(0,1,2),header=(0), chunksize=1000)
    for chunk in df:
        chunk.to_csv(os.path.join(folder, new_folder,
                                  "new_file_" + filename), 
                                  columns = [['TIME','STUFF']])

score 41 · Accepted Answer · edited Apr 24 '20 at 10:18

41

Solution:

header = True
for chunk in chunks:

    chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),
        header=header, cols=[['TIME','STUFF']], mode='a')

    header = False

Notes:

The mode='a' tells pandas to append.
We only write a column header on the first chunk.

edited Apr 24 '20 at 10:18

smci

29,564
18
109
144

answered Jul 22 '16 at 16:27

Scratch'N'Purr

8,710
2
30
47

1

I've noticed that when I append using mode='a', the column labels are written after every chunk. How do I make sure column labels only appear at the beginning of the file? – Korean_Of_the_Mountain Jul 22 '16 at 20:00
10

You can pass header=None to all but the first chunk – DG2 Jun 01 '17 at 18:46
4

You could do `for i, chunk in chunks:`, and then `header=(i==0)` – naught101 Dec 21 '20 at 02:31

score 18 · Answer 2 · answered Jul 22 '16 at 16:27

18

Check out the chunksize argument in the to_csv method. Here are the docs.

Writing to file would look like:

df.to_csv("path/to/save/file.csv", chunksize=1000, cols=['TIME','STUFF'])

answered Jul 22 '16 at 16:27

Alex

17,062
7
54
78

Hmm I got the following error using your proposed method: AttributeError: 'TextFileReader' object has no attribute 'to_csv' Your answer is still assuming I'm reading into "df" in chunks? – Korean_Of_the_Mountain Jul 22 '16 at 16:46
2

This is for a complete DataFrame. – Alex Jul 22 '16 at 17:07
this is not helpful when streaming a giant dataframe from one file to another, in that case mode='a' is better. – denfromufa Nov 09 '17 at 20:21
@denfromufa Is that for sure? chunksize could mean writing in batches, could it not? And then it would have to be done in `append` mode anyway. Or am I missing something? I do not know the technical details, though, just a guess. Has anybody more insight into this, is this here the same as the accepted answer with its loop? – questionto42standswithUkraine Jan 31 '22 at 16:44
I can assure that this worked on a 50 MB file on 700000 rows with chunksize 5000 many times faster than a normal csv writer that loops over batches. I have not checked the loop over dataframes in `append` mode as in the accepted answer, but this answer cannot be bad at least. Brought down the Cloud Function time down to 62s from >9min timeout limit before (I do not even know how long it would have taken for writing all data, but much longer, obviously). – questionto42standswithUkraine Feb 01 '22 at 14:16

Alexander · Answer 3 · 2016-07-22T20:18:04.377

1

Why don't you only read the columns of interest and then save it?

file_in = os.path.join(folder, filename)
file_out = os.path.join(folder, new_folder, 'new_file' + filename)

df = pd.read_csv(file_in, sep='\t', skiprows=(0, 1, 2), header=0, names=['TIME', 'STUFF'])
df.to_csv(file_out)

edited Jul 22 '16 at 20:18

answered Jul 22 '16 at 17:11

Alexander

96,739
27
183
184

Just in case I came across files that were so big that I would have to read in as chunks. I don't believe your code would allow me to do that, correct? – Korean_Of_the_Mountain Jul 22 '16 at 19:59
Correct, but it is still much more efficient. If that were the case, you would still need to chunk or else use the csv module. – Alexander Jul 22 '16 at 20:03

Writing large Pandas Dataframes to CSV file in chunks

3 Answers3

Linked

Related