0

I have Pandas data frame big data frame loaded into memory. Trying to utilize memory more efficient way.

For this purposes, i won't use this data frame after i will subset from this data frame only rows i am interested are:

DF = pd.read_csv("Test.csv")
DF = DF[DF['A'] == 'Y']

Already tried this solution but not sure if it most effective. Is solution above is most efficient for memory usage? Please advice.

MaxU - stop genocide of UA
  • 191,778
  • 30
  • 340
  • 375
Felix
  • 1,389
  • 6
  • 18
  • 34

1 Answers1

1

you can try the following trick (if you can read the whole CSV file into memory):

DF = pd.read_csv("Test.csv").query("A == 'Y'")

Alternatively, you can read your data in chunks, using read_csv()

But i would strongly recommend you to save your data in HDF5 Table format (you may also want to compress it) - then you could read your data conditionally, using where parameter in read_hdf() function.

For example:

df = pd.read_hdf('/path/to/my_storage.h5', 'my_data', where="A == 'Y'")

Here you can find some examples and a comparison of usage for different storage options

Community
  • 1
  • 1
MaxU - stop genocide of UA
  • 191,778
  • 30
  • 340
  • 375