0

I am trying to read a file using multiple threads. I want to divide the file into chunks so that each thread could act separately on each chunk which eliminates the need of a lock as the data is not being shared among different threads. How could I possible do this slicing in memory using python ? To explain it further -

I would need to read the file beforehand to count the number of lines in the file so that I can decide the chunk size (say chunk size = total number of lines/no of threads). In this case, as soon as the main process reads the first chunk, I would want the threads to start processing those lines in the chunk simultaneously.

Could someone provide a sample example?

psbits
  • 1,487
  • 4
  • 18
  • 31
  • 1
    If you want to gain a significant speed improvement on this in Python, you will probably need to use multiprocessing rather than multithreading. – Warren Dew Sep 18 '15 at 23:53
  • Could you please give some reasons why multiprocessing would be more faster in this case ? Is it because of thread context switch overhead ? – psbits Sep 19 '15 at 00:00
  • If you are reading the file once why not make life easy for yourself by passing off the chunks to the different threads intead of making them read the file again – e4c5 Sep 19 '15 at 00:19
  • The interesting point is then you have to spawn those many number of threads. If a file has 1000 lines and you fix the chunk size as 20 lines, you would have to launch 50 threads. I want to fix the number of threads to say 5 and let those 5 threads act on those 1000 lines. And, I am not making them read the file again. The main process would store the lines in memory (say in an array) and let each thread take a slice of that array. Once one of the thread exits, I can spawn another thread and let it take another chunk of data. Some sort of a thread pool perhaps. – psbits Sep 19 '15 at 00:25
  • 1
    possible duplicate of [How efficient is threading in Python?](http://stackoverflow.com/questions/5128072/how-efficient-is-threading-in-python) – Warren Dew Sep 19 '15 at 00:25
  • 1
    The issue is that, due to the global interpreter lock, Python, or at least CPython, cannot actually have multiple threads access memory simultaneously. See the possible duplicate for details. Multiprocessing removes that constraint, since different processes don't share the same memory space, so their memory space locks don't interfere. – Warren Dew Sep 19 '15 at 00:27
  • Thanks Warren. That's a perfect explanation. But i am just curious if i have to go ahead with the approach I mentioned in my previous comment. How can I go about implementing it ? If you see, I am interested in making the threads process mutually independent set of data in a file. That means, there is no shared section among the threads. – psbits Sep 19 '15 at 00:27
  • So let the chunk size be number-of-lines / number-of-processes, suitably rounded and possibly fewer lines for the highest-numbered process. The processes receive discrete slices of the file, and can churn away in parallel. This works! I've done it several times. – BrianO Sep 19 '15 at 00:44
  • Hey Brian. Could you possible provide some code snippets to do that ? Were there any issues you had to take care of specifically? I am little new to this world of multiprocessing with Python. Thanks ! – psbits Sep 19 '15 at 00:49
  • @psbits: It'd be hard for me to tease out the relevant parts from my own code, and most instances are being run by some other party. However, here is a very good article which contains example code and benchmarks: [Python - parallelizing CPU-bound tasks with multiprocessing (Eli Bendersky, 2012)](http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/). That should fill in the blanks for you. – BrianO Sep 19 '15 at 17:17
  • Thanks Brian. That was really helpful. Implemented it. – psbits Sep 21 '15 at 22:08

1 Answers1

1

There is no way to count the lines in a file without reading it (you could mmap it to allow the virtual memory subsystem to page out data under memory pressure, but you still have to read the whole file in to find the newlines). If chunks are defined as lines, you're stuck; the file must be read in one way or another to do it.

If chunks can be fixed size blocks of bytes (which may begin and end in the middle of a line), it's easier, but you need to clarify.

Alternatively, if neighboring lines aren't important to one another, instead of chunking, round robin or use a producer/consumer approach (where threads pull new data as it becomes available, rather than distributing by fiat), so the work is naturally distributed evenly.

multiprocessing.Pool (or multiprocessing.dummy.Pool if you must use threads instead of processes) makes this easy. For example:

def somefunctionthatprocessesaline(line):
    ... do stuff with line ...
    return result_of_processing

with multiprocessing.Pool() as pool, open(filename) as f:
    results = pool.map(somefunctionthatprocessesaline, f)
... do stuff with results ...

will create a pool of worker processes matching the number of cores you have available, and have the main process feed queues that each worker pull lines from for processing, returning the results in a list for the main process to use. If you want to process the results from the workers as they become available (instead of waiting for all results to appear in a list like Pool.map does), you can use Pool.imap or Pool.imap_unordered (depending on whether the results of processing each line should be handled in the same order the lines appear) like so:

with multiprocessing.Pool() as pool, open(filename) as f:
    for result in pool.imap_unordered(somefunctionthatprocessesaline, f):
        ... do stuff with one result ...
ShadowRanger
  • 124,179
  • 11
  • 158
  • 228