6

Even though I try to keep it as simple as possible, the pipelines for some of my data science projects get rather complex. At some point it becomes necessary to document this pipeline so that someone can return to the project, easily understand the various scripts and data-sources/outputs, and then update/modify it.

My question is: how do people document complex pipelines? For pipelines of medium complexity, I have take to drawing out the components of the pipeline with:

https://docs.google.com/drawings

Any suggestions for (or links to) best practices for documentation would be most appreciated.

captain_ahab
  • 1,512

1 Answers1

1

My 2p since this question has resurfaced to the top after few years with no answer...

In my opinion, it pays off to invest in a workflow manager. I'm very happy with snakemake and it's been a game-changer after having spent quite some time hacking together README files, bash scripts, and diagrams to keep track of complex pipelines.

Snakemake and similar tools are not easy to learn and depending on the complexity of your workflow, not easy to read either. However, they guarantee reproducibility and because you code input/output dependencies you don't need to rely on documenting what each step takes in input and output. Besides, snakemake has also options for reporting and printing the graph of job dependencies. I think that manually drawing the graph is a lot of work, error-prone, and difficult to keep in sync with the pipeline.

dariober
  • 4,250