0

I have recently faced myself with the following situation:

I would like to optimize (speed-up) a given python script by means of the multiprocessing strategy. This given script should do one of the two situations:

  1. receive a FileHandler (in reading mode) and read data from it for respective processing purposes, and then later return the results back to the main process;

  2. receive a FileHandler (in writing mode) and use it to write the results into the disc.

Since both above cases consider a FileHandler, my question here can be generalizable to the following issue:

Since a FileHandler is, per se, a single object of a main process, how come it can be directly applicable under a multiprocessing perspective? When one parses an iterable item into a pool of processes, each process can independently operate on each item of the iterable. A simple example of this case is presented below:

import multiprocessing

def foo(x):
    return x**2


if "__main__" = __name__:
    with multiprocessing.Pool() as pool:
    
        ListOfItems = range(6)

        Results = pool.map(foo, ListOfItems)

    print(Results)

# >> 0, 1, 4, 9, 16, 25

Nevertheless, when the FileHandler is parsed to the pool, it is of my understanding that the main process first has to read the data, and then parse that data to the pool. Only then, the processes from the pool can operate upon each of the the provided (independent) entries of the FileHandler. I present below a second code snippet representing this idea:

import multiprocessing

def foo2(line):
    ... do stuff with line ...
    return result_of_processing

if "__main__" = __name__:
    with multiprocessing.Pool() as pool:

        with open(filename) as f:
            Results2 = pool.map(foo2, f)

Finally, there is the case that considers a FileHandler in writing mode. What would be the best approach for parsing a same FileHandler for a pool of concurrent processes? Since only one process can have access to the disc (for writing data purposes here), even if one uses a Semaphore or a Lock for controlling the disc access by the FileHandler, the shared-object-between-concurrent-processes situation continues, and the shared-object here would be the FileHandler. Therefore, having this in mind, how should one structure one's script so to manage that shared object situation? Should one try to implement a Queue with a FileHandler on it? Or even, use a multiprocessing.Manager for this same purpose, so that this Manager would then carry the FileHandler to each of the processes of the Pool?

I would like to point out that this question arose from several different references, and some are presented below:

  1. opening file using multiprocessing
  2. Virtual files handling
  3. shared objects by means of Queues
  4. Shared pandas objects between processes

Finally, I would like to propose two code snippets (code A and B) so to be a guide to this whole discussion. Code A is one that reads geometric data by means of the Fiona package, and the second is one writes geometric data into the disc. Both codes (A and B) are meant to be developed under a multiprocessing approach.

code A

import multiprocessing
import fiona

def foo3(feature):
    ... do stuff with the feature ...
    return result_of_processing

if "__main__" = __name__:
    with multiprocessing.Pool() as pool:

        with fiona.open('data.shp') as f:
            bbox=(-5.0, 55.0, 0.0, 60.0)
            Results2 = pool.map(foo3, f.items(bbox = bbox))

code B

from collections import OrderedDict
import multiprocessing
import fiona

def foo4(feature, writer):
    writer.write(feature)
    return None

if "__main__" = __name__:
    with multiprocessing.Pool() as pool:
    schema_props = OrderedDict([("name", "str")])

        with fiona.open('data.shp', 'w',

            driver="ESRI Shapefile", schema={"geometry": "Polygon", "properties": schema_props}) as FileHandler:

            N  = 30 # number of features to be stored in the file
            Features = []
            for i in range(1, N+1):
                Geom = (-5*i, 5, 0.0, 6*i)
                Feat  = {"geometry": {"type": "Polygon", "coordinates": Geom},
"properties": OrderedDict([("name", "{0}".format(i)])}
                Features.append( [
    Feat, FileHandler])
            
        Results2 = pool.starmap(foo4, Features)

0 Answers0