1

I have a 200 million record file, which is being read using pandas read_csv in chunksize of 10000. These dataframes are being converted into a list object, and this list object is passed to a function.

file_name=str(sys.argv[2])
df=pd.read_csv(file_name, na_filter=False, chunksize=10000)
for data in df:
    d=data.values.tolist()
    load_data(d)

Is there any way load_data function call can be run parallelly, so that more than 10000 records can be passed to the function at the same time?

I tried using solutions mentioned in below questions:

  1. Python iterating over a list in parallel?
  2. How to run functions in parallel?

But these don't work for me, as I need to convert the dataframe into list object first before calling the function.

Any help will be highly appreciated.

martineau
  • 112,593
  • 23
  • 157
  • 280
Yash Sharma
  • 304
  • 2
  • 15

1 Answers1

2

Yes, dask is very good at this

Try

import dask.dataframe as dd

dx = dd.read_csv(file_name, na_filter=False)

ans_delayed = dx.apply(my_function, meta='{the return type}')

ans = ans_delayed.compute()

If you really need the data as a list, you could try

import dask.bag as db

genrator = pd.read_csv(file_name, na_filter=False, chunksize=10000)

ans = db.from_sequence(generator).map(lambda df: 
load_data(df.values.tolist())).compute()
Myccha
  • 801
  • 1
  • 8
  • 18