25

Is there a fast way to do serialization of a DataFrame?

I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.

How can I save data frame in a binary format that can be loaded rapidly?

Andy Hayden
  • 328,850
  • 93
  • 598
  • 514
James Bond
  • 6,857
  • 16
  • 49
  • 61
  • See related question http://stackoverflow.com/questions/12772498/serialize-pandas-python-dataframe-to-binary-format – Mihai8 Jun 06 '13 at 20:44
  • Nice [blog post](http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) with timings/discussions of the different I/O options – lunguini Feb 25 '19 at 20:36

3 Answers3

26

The easiest way is just to use to_pickle (as a pickle), see pickling from the docs api page:

df.to_pickle(file_name)

Another option is to use HDF5 (built on PyTables). It is slightly more work to get started but much richer for querying.

Felix D.
  • 346
  • 6
  • 17
Andy Hayden
  • 328,850
  • 93
  • 598
  • 514
5

DataFrame.to_msgpack is experimental and not without some issues e.g. with Unicode, but it is much faster than pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.

Felix D.
  • 346
  • 6
  • 17
Sergey Orshanskiy
  • 6,474
  • 1
  • 44
  • 48
1

Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.

Achim
  • 14,927
  • 14
  • 75
  • 135