0

I have a 80gb h5 file, and I want to just read say a random set of 1000 columns and assume I do not know the column names. How would we achieve this?

justanewb
  • 37
  • 1
  • 9
  • `df.sample(n=1000,axis=1)`? – Anurag Dabas Jul 13 '21 at 03:17
  • 1
    @AnuragDabas That would require loading the entire dataframe first. – justanewb Jul 13 '21 at 03:19
  • I think this can help so have a look at [read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame](https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame/22259008#22259008) – Anurag Dabas Jul 13 '21 at 03:26

1 Answers1

1

You should first know the number of columns in your file. Let's assume 10000 here.

You can then use a combination of numpy.random and the columns option of pandas.read_hdf:

pd.read_hdf('file', columns=sorted(np.random.choice(range(10000), size=1000, replace=False)))
mozway
  • 81,317
  • 8
  • 19
  • 49
  • The number of columns is 12,246 I found this out by trying to load the whole file and looking at the memory error. Although, when I tried your solution I get the following error `TypeError: cannot pass a column specification when reading a Fixed format store. this store must be selected in its entirety` – justanewb Jul 14 '21 at 03:56
  • Can you provide a dummy example? – mozway Jul 16 '21 at 12:10