10

What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?

I thought about doing something like:

stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)): 
    start = i 
    end = i + stepsize-1 #neglect last row ...
    df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')

But I bet there is a smarter solution out there?

As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.

Trenton McKinney
  • 43,885
  • 25
  • 111
  • 113
PlagTag
  • 5,234
  • 5
  • 31
  • 46

2 Answers2

14

Use id in the filename else it will not work. You missed id, and without id, it gives an error.

for id, df_i in  enumerate(np.array_split(df, number_of_chunks)):
    df_i.to_csv('/data/bs_{id}.csv'.format(id=id))
Gautam Shahi
  • 415
  • 4
  • 14
10

This answer brought me to a satisfying solution using:

for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
    chunk.to_csv(f'/data/bs_{idx}.csv')
Trenton McKinney
  • 43,885
  • 25
  • 111
  • 113
PlagTag
  • 5,234
  • 5
  • 31
  • 46