1

This is a followup question to @Omry Yadan's Dec/9/2018 answer at How do subprocess.Popen pipes work in Python?. I need to create a three program pipeline and collect stderr and return codes from all three programs. My current solution (below), based on that Dec/9/2018 answer, hangs. The equivalent command pasted in the shell finishes quickly.

The amount of data being piped from stdout to stdin is in the Mbyte realm. The final stdout, as well as the three stderrs, are expected to be much smaller.

#!/usr/bin/env python3

cmd1 = ["gzip", "-dc", "some_file.gz"]
cmd2 = ["filtering_program", "some", "arguments"]
cmd3 = ["collection_program", "some", "arguments"]

p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE)
p2 = Popen(cmd1, stdin=p1.stdout, stdout=PIPE, stderr=PIPE)]
p3 = Popen(cmd1, stdin=p2.stdout, stdout=PIPE, stderr=PIPE)]
(outcome_stdout, outcome_stderr) = p3.communicate()
p1.wait()
p2.wait()

1 Answers1

0

Note the following issues that are documented. Also note that the pipe buffer either grows to (MacOS) is is limited to 65536 bytes by default (modern Linux)

Popen.wait

Note This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

Popen.stderr

Warning Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.

gdahlm
  • 1,047
  • 7
  • 8
  • Thanks. I did see that note about potential deadlock, but it is not clear to me HOW to use Popen.communicate() to avoid deadlock. Should I be calling p1.communicate() and p2.communicate()? And in what order relative to the p*.wait() calls? – Rotwurg Nwossle Jul 10 '21 at 18:47
  • Are you aware of any better reference for subprocess than https://docs.python.org/3/library/subprocess.html? What I am trying to do seems like a natural use of the package, is trivial to run in the shell, thus ought to be have a simple python solution. Yet I can't figure out from those docs how to accomplish it. – Rotwurg Nwossle Jul 10 '21 at 18:52
  • It depends on what you are trying to accomplish as preferred style of development and reasons of trying to add concurrency will change the answer. I would be tempted to just use run() in series first unless you have a reason to add complexity. The issue with deadlocks and concurrency, especially when using the lower level Popen() are far more complex than I can answer with a few 100 chars. Here is a video that will help understand the pitfalls https://youtu.be/Bv25Dwe84g0 Personally, I would use _async for loops_ and because deadlocks are hard and the task is not truly parallel. – gdahlm Jul 10 '21 at 19:40
  • Thanks for that link. Indeed, I do plan to add concurrency but was hoping to get this working without it since I expected it would be easier to debug a non-concurrent version. I am using run() for other commands in my script but it deadlocks as soon as I try two piping commands together. Adding shell=true didn't help much — I believe the issue is the amount of data going from one process to the other. I hadn't considered using a series of single command run()s — I presume this would be with the piped data being completely held in python objects. That may be 'best' solution. Thanks! – Rotwurg Nwossle Jul 10 '21 at 21:44