Python Pandas memory - subsetting and releasing main data frame

Question

I have Pandas data frame big data frame loaded into memory. Trying to utilize memory more efficient way.

For this purposes, i won't use this data frame after i will subset from this data frame only rows i am interested are:

DF = pd.read_csv("Test.csv")
DF = DF[DF['A'] == 'Y']

Already tried this solution but not sure if it most effective. Is solution above is most efficient for memory usage? Please advice.

score 1 · Accepted Answer · edited May 23 '17 at 12:31

you can try the following trick (if you can read the whole CSV file into memory):

DF = pd.read_csv("Test.csv").query("A == 'Y'")

Alternatively, you can read your data in chunks, using read_csv()

But i would strongly recommend you to save your data in HDF5 Table format (you may also want to compress it) - then you could read your data conditionally, using where parameter in read_hdf() function.

For example:

df = pd.read_hdf('/path/to/my_storage.h5', 'my_data', where="A == 'Y'")

Here you can find some examples and a comparison of usage for different storage options

Python Pandas memory - subsetting and releasing main data frame

1 Answers1