4

I have a python script which has to call a certain app 3 times. These calls should be parralel since they take hours to complete and arent dependant on eachother. But they script should halt until all of them are complete and then do some clean up work.

Here is some code:

#do some stuff

for work in worklist:   # these should run in parralel
    output=open('test.txt','w')
    subprocess.call(work,stdout=output,stderr=output)
    output.close()

# wait for subprocesses to finish

# cleanup

so I basically want to run this command in parrelel while capturing its output to a file. once all instances are done I want to continue the script

  • related: [Python: running subprocess in parallel](http://stackoverflow.com/q/9743838/4279), [Python threading multiple bash subprocesses](http://stackoverflow.com/q/14533458/4279), [Python: running subprocess in parallel](http://stackoverflow.com/q/16450788/4279). – jfs May 23 '14 at 22:11

2 Answers2

9

subprocess.call() is blocking. That means, each call must wait for the child process to finish before continuing.

What you want is to pass your arguments to subprocess.Popen constructor, instead. That way, your child process would be started without blocking.

Later on, you can join these child processes together by calling Popen.communicate() or Popen.wait().

child_processes = []
for work, filename in worklist:
    with io.open(filename, mode='wb') as out:
        p = subprocess.Popen(work, stdout=out, stderr=out)
        child_processes.append(p)    # start this one, and immediately return to start another

# now you can join them together
for cp in child_processes:
    cp.wait()                         # this will block on each child process until it exits

P.S. Have you looked into Python's documentation on the subprocess module?

Mikhail
  • 7,403
  • 10
  • 58
  • 126
Santa
  • 11,123
  • 8
  • 49
  • 64
  • 1
    You run the risk that later processes will stall if their stdout/stderr pipes fill before the for loop gets around to calling communicate(). One easy solution is to pipe stdout/err to temporary files. – tdelaney May 22 '14 at 23:31
  • can I directly link stdout and stderror to a file handle? stdout=filehandle? also shouldn`t it be out, err = cp.communicate()? –  May 22 '14 at 23:40
  • They can indeed be set to existing file handles. – Santa May 22 '14 at 23:45
  • if I route them directly to a text file how would I wait for the processes to finish? –  May 22 '14 at 23:47
  • @prgmjunkie, yes. `stdout=open('my-process-out.txt', 'w')` works. – tdelaney May 22 '14 at 23:48
  • @prgmjunkie You can call `p.wait()` to join them without any pipe communication. – Santa May 22 '14 at 23:54
  • I have a question than though, how would I go about using the same handle for stderr. and I won`t be able to close the file this way? Or will it autoclose once it finishes? –  May 23 '14 at 00:03
  • @prgmjunkie I updated my answer with using open file handle (the `with` statement will auto close it. – Santa May 23 '14 at 00:10
  • thank you very much for your help, the only problem is that this will output all the output from all commands into a single textfile. my worklist is actually a list of dict with the work command and the corresponding filename as key-value pairs. –  May 23 '14 at 00:13
  • @prgmjunkie You can easily modify my sample code to do just that. Just open a new file handle for each iteration of the loop. – Santa May 23 '14 at 00:19
  • wont there be a clash between filehandles? because they have the same name? –  May 23 '14 at 00:21
  • Then don't give them the same names. – Santa May 23 '14 at 00:22
  • @prgmjunkie There. Did that for you, too. – Santa May 23 '14 at 00:27
2

I like to use GNU Parallel (http://www.gnu.org/software/parallel/) in situations like this (requires *nix), as it provides a quick way to get parallelism and has many options, including re-organizing the output at the end such that it all flows together from each process in order but not interleaved. You can also specify the number you want to run at once, either a specific number, or matching the number of cores you have, and it will queue up the rest of the commands.

Just use subprocess.check_output with shell=True to call out to parallel using your command string. If you've got a variable you want to interpolate, say a list of SQL tables you want to run your command against, parallel is good at handling that as well -- you can pipe in the contents of a text file with the arguments.

If the commands are all totally different (as opposed to being variations on the same command), put the complete commands in the text file that you pipe into parallel.

You also don't need to do anything special to wait for them to finish, as the check_output call will block until the parallel command has finished.

khampson
  • 13,812
  • 4
  • 39
  • 40
  • `shell=True` is unsafe almost any context. – jbarlow Dec 17 '15 at 22:21
  • 1
    There are [potential issues](https://docs.python.org/2/library/subprocess.html#frequently-used-arguments), but there are certainly cases where it is fine. e.g. the input is *not* coming from arbitrary sources from the external web, etc. – khampson Dec 17 '15 at 22:40