Best practices for documenting a data-science pipeline

Question

Even though I try to keep it as simple as possible, the pipelines for some of my data science projects get rather complex. At some point it becomes necessary to document this pipeline so that someone can return to the project, easily understand the various scripts and data-sources/outputs, and then update/modify it.

My question is: how do people document complex pipelines? For pipelines of medium complexity, I have take to drawing out the components of the pipeline with:

https://docs.google.com/drawings

Any suggestions for (or links to) best practices for documentation would be most appreciated.

I wonder if this question might be better suited for the Data Science Stack Exchange? — Silverfish, Jan 31 '15 at 02:03
You might want to investigate drawing tools which can be programmed for larger scale pipelines. I tend to use Mathematica, but I'd guess there would be a suitable Python library if that fits with your skill set. — image_doctor, Jan 31 '15 at 10:47
This is definitely an opinion-based question. But for what it is worth, I am finding that kedro provides a nice framework which has some self-documentation but also provides ready-to-use Sphinx boilerplate. — Galen, Aug 11 '23 at 03:11

dariober · Answer 1 · 2023-03-07T09:18:44.830

My 2p since this question has resurfaced to the top after few years with no answer...

In my opinion, it pays off to invest in a workflow manager. I'm very happy with snakemake and it's been a game-changer after having spent quite some time hacking together README files, bash scripts, and diagrams to keep track of complex pipelines.

Snakemake and similar tools are not easy to learn and depending on the complexity of your workflow, not easy to read either. However, they guarantee reproducibility and because you code input/output dependencies you don't need to rely on documenting what each step takes in input and output. Besides, snakemake has also options for reporting and printing the graph of job dependencies. I think that manually drawing the graph is a lot of work, error-prone, and difficult to keep in sync with the pipeline.

Thank you for sharing snakemake, I didn't know about it (+1). I have been having success with Kedro. — Galen, Dec 09 '23 at 05:28

Best practices for documenting a data-science pipeline

1 Answers1