Read in n number of random columns in pandas

Question

I have a 80gb h5 file, and I want to just read say a random set of 1000 columns and assume I do not know the column names. How would we achieve this?

@AnuragDabas That would require loading the entire dataframe first. — justanewb, Jul 13 '21 at 03:19
I think this can help so have a look at [read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame](https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame/22259008#22259008) — Anurag Dabas, Jul 13 '21 at 03:26

score 1 · Accepted Answer · answered Jul 13 '21 at 05:00

1

You should first know the number of columns in your file. Let's assume 10000 here.

You can then use a combination of numpy.random and the columns option of pandas.read_hdf:

pd.read_hdf('file', columns=sorted(np.random.choice(range(10000), size=1000, replace=False)))

answered Jul 13 '21 at 05:00

mozway

The number of columns is 12,246 I found this out by trying to load the whole file and looking at the memory error. Although, when I tried your solution I get the following error `TypeError: cannot pass a column specification when reading a Fixed format store. this store must be selected in its entirety` – justanewb Jul 14 '21 at 03:56
Can you provide a dummy example? – mozway Jul 16 '21 at 12:10

1 Answers1