3

I want to save pandas table in a file, so I can read it from that file later. My requirements:

  • the file format should be decently portable (good library support on Windows/Linux in major languages)

  • the DataFrame I read should be absolutely identical to the one I saved

According to this post, read_csv and to_csv may work if I provide index_col=0 argument, but the datatypes are lost (and of course, automatic type inference doesn't guarantee to give me the same types even for simple types, not to mention if I use python objects like lists which are never inferred).

Is there some simple solution that just works for sure, without having to worry about many edge cases?

The only solution I can think of, is using to_csv / read_csv, but save type information somewhere else. Still, I'm afraid there might more hidden problems (like duplicate column names, etc.).

Community
  • 1
  • 1
max
  • 45,169
  • 49
  • 189
  • 342
  • @tzaman I guess it's related, but that question is focused on speed, and the top/accepted answer is completely inappropriate in my case since I'm looking for portability. (pickle files can't be read outside of python, not easily). – max Aug 12 '16 at 22:09
  • 1
    That same answer also mentions `hdf5`. Does that not satisfy? – piRSquared Aug 12 '16 at 22:21
  • @piRSquared Yup just checked and it works. (Apart from same-name columns which are not allowed, but it's ok.) I didn't see any guarantee in the docs that HDF5 read/write are invertible, but I guess it just happens to be.. – max Aug 12 '16 at 23:14
  • I use it regularly. It's very fast and portable. Only thing I can't verify is strong support from other languages. But I do see on wikipedia that it is supported widely. – piRSquared Aug 12 '16 at 23:15
  • @piRSquared yes, definitely perfect solution. – max Aug 12 '16 at 23:17

1 Answers1

-1

pd.DataFrame.to_pickle / pd.read_pickle hold columns data types. Let's check it out:

df_in.to_pickle('input_5')
df_out = pd.read_pickle('/input_5')
ragesz
  • 7,852
  • 18
  • 68
  • 86