How to read / process large files in parallel with Python

Question

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.

Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.

Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?

I'm using Python 3.6.X

No. Python cannot do real parallel threading. Read https://stackoverflow.com/questions/18114285/what-are-the-differences-between-the-threading-and-multiprocessing-modules. An alternative method you may try is to divide your file into multiple file, and run with multiprocessing. — MT-FreeHK, Jun 01 '18 at 04:31
the task is simple just read the file and convert it into JSON and perform operations regarding ML, — Nodirbek Shamsiev, Jun 01 '18 at 05:15

score 3 · Answer 1 · answered Jun 01 '18 at 05:01

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.

Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

score 1 · Answer 2 · answered Jun 01 '18 at 05:22

There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing. If that does not help, you could try:

Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.

How to read / process large files in parallel with Python

2 Answers2

Linked

Related