Serialization of a pandas DataFrame

Question

Is there a fast way to do serialization of a DataFrame?

I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.

How can I save data frame in a binary format that can be loaded rapidly?

See related question http://stackoverflow.com/questions/12772498/serialize-pandas-python-dataframe-to-binary-format — Mihai8, Jun 06 '13 at 20:44
Nice [blog post](http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) with timings/discussions of the different I/O options — lunguini, Feb 25 '19 at 20:36

score 26 · Accepted Answer · edited May 14 '22 at 14:27

26

The easiest way is just to use to_pickle (as a pickle), see pickling from the docs api page:

df.to_pickle(file_name)

Another option is to use HDF5 (built on PyTables). It is slightly more work to get started but much richer for querying.

edited May 14 '22 at 14:27

Felix D.

346
6
17

answered Jun 06 '13 at 20:46

Andy Hayden

328,850
93
598
514

1

Their docs seem to need some work. The `.save()` method has absolutely no description. – voithos Jun 06 '13 at 20:48
@voithos I realised that as I was looking for a link... :( – Andy Hayden Jun 06 '13 at 20:49
2

[This](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.common.save.html) seems to be the best out there... – Andy Hayden Jun 06 '13 at 20:52
6

FWIW save will be changed to [to_pickle](https://github.com/pydata/pandas/issues/3782) in pandas 0.12. – Andy Hayden Jun 06 '13 at 21:08
3

Please note [Python 3 - Can pickle handle byte objects larger than 4GB?](https://stackoverflow.com/q/31468117/562769) – Martin Thoma Nov 14 '18 at 11:25

score 5 · Answer 2 · edited May 14 '22 at 19:27

5

DataFrame.to_msgpack is experimental and not without some issues e.g. with Unicode, but it is much faster than pickling. It serialized a dataframe with 5 million rows that was taking 2-3 Gb of memory in about 2 seconds, and the resulting file was about 750 Mb. Loading is somewhat slower, but still way faster than unpickling.

edited May 14 '22 at 19:27

Felix D.

346
6
17

answered Jan 27 '15 at 03:02

Sergey Orshanskiy

6,474
1
44
48

3

to_msgpack is deprecated since 0.25.0. – Kevin Jan 24 '20 at 23:06
Broken link. The project, as noted, must not exist any more (?). – Dan Nissenbaum Jun 08 '21 at 06:31
Relevant issue for the deprecation: https://github.com/pandas-dev/pandas/issues/27084 – Felix D. May 14 '22 at 09:03

score 1 · Answer 3 · answered Jun 06 '13 at 20:45

1

Have to timed the available io functions? Binary is not automatically faster and HDF5 should be quite fast to my knowledge.

answered Jun 06 '13 at 20:45

Achim

14,927
14
75
135

Serialization of a pandas DataFrame

3 Answers3

Linked

Related