0

I'm trying to merge very large csv files together using pandas and keep running out of memory. Is there a more memory efficient way to do this?

I've tried using Dask Dataframes instead of just pandas, but I still ran into the same problem.

temp_df = dd.merge(gcs_df, dias_fio_df, how='inner', on=['charttime', 'subject_id'])

I get a MemoryError: screenshot_of_error

Ana B
  • 1
  • 2

1 Answers1

0

Did you try to run the merges separately? Can you do the first merge without a memory error? You can try splitting the cell into two different cells, and deleting from memory the first two dataframes (gcs_e_df and gcs_m_df) using del, before doing the second merge.

Another thing that might creating the issue is if you have duplicate ['charttime', 'subject_id'] in your tables: this will completely blow the memory because you will join each row multiple times and the size of your dataframe will increase exponentially

astiegler
  • 129
  • 11
  • Can you explain more about how to use del (maybe include an example)? – Ana B Jun 24 '19 at 14:45
  • See the bottom answer of this question for an example [https://stackoverflow.com/questions/32247643/how-to-delete-multiple-pandas-python-dataframes-from-memory-to-save-ram]. You just have to use the del operator on the dataframes and call the garbage collector – astiegler Jun 24 '19 at 16:19
  • Thanks for the example! So I use del on the previous dataframes I created to clear up space so the merge now works. However when I try to write this newly created/merged dataframe to a csv file, I get a memory error. Any tips here? – Ana B Jun 24 '19 at 18:32
  • I think you can try to reduce the chunksize for the to_csv – astiegler Jun 25 '19 at 13:39