using python to write bioinformatics pipelines tutorial

Question

I was wondering if there is a tutorial or a small code snippet to understand how to write bioinformatics pipeline using python, for example

use a aligner (say hisat)
get the output and process it using samtools

I was able to use subprocess from python2.7 for this purpose using samtools but i am not able to link both the processes.i.e given path(which I can use argparse) for directory with fastq files output would be processed bam.

sample code for samtools sam to bam :

import subprocess
subprocess.run(['samtools','view', '-bS',
                '../some.sam',
                '>',
                '../some.bam'])

If you ever need more elaborate pipelines in python, then I would strongly encourage you to look into snakemake, which is pretty commonly used for this purpose. — Devon Ryan, Jul 06 '17 at 06:53
I agree that snakemake is worth learning. If you want to use subprocess, I found the following tutorial just yesterday, which looked quite clear: http://crashcourse.housegordon.org/python-subprocess.html — bli, Jul 06 '17 at 11:58
@DevonRyan I looked snakemake it is available only in python3 is that correct? — novicebioinforesearcher, Jul 06 '17 at 12:28
@novicebioinforesearcher Note that if you need to program in python2, you can still do it in external scripts and call them from snakemake like you would do for any other program. — bli, Jul 06 '17 at 15:02
@bli how can I do that, sorry I am still learning stuff here could you share an example — novicebioinforesearcher, Jul 06 '17 at 15:31
@novicebioinforesearcher These comments are not adapted to give examples. What I mean is that you will still need python3 for snakemake, but if there are things you can only do in python2, you can write whatever computation you want to do in python2 in the form of an independent program (presumably accepting command-line arguments). Then, you will call this program from within snakemake (or whatever workflow management tool you happen to use, or a python script using subprocess) exactly like you would do for samtools, hisat or any other program. — bli, Jul 06 '17 at 16:41
I'm curious as to why you are averse to writing your pipeline in bash? @novicebioinforesearcher — quantik, Jul 06 '17 at 18:28
@quantik because I am learning to write codes in python and also because (i may be wrong here) in case you have a greater number of samples bash would not be a good way to work(foreg 40 samples to be processed) i can use os.walk etc in python to submit jobs correct me if i am wrong — novicebioinforesearcher, Jul 06 '17 at 18:44

score 13 · Answer 1 · answered Jul 06 '17 at 09:12

Taking a different tack from other answers, there's lots of tools for pipelines in Python. Note: there was a time when people would use "pipeline" to refer to a shell script. I'm talking about something more sophisticated that helps you decompose an analysis into parts and runs it robustly.

Snakemake is my favourite. It's (nearly) pure Python and can generate reports.
Nextflow is growing in popularity and is pretty straightforward
Ruffus used to be reasonably popular, seemed fine when I used it
Bpipe is for bioinformatics
Airflow is a more industrial "big" solution but in wide use
See this big list of pipeline systems

I'm sure you can find something there.

Bioathlete · Answer 2 · 2017-07-06T02:35:12.113

BioPython has some good tools for processing reads and alignments. http://biopython.org/DIST/docs/tutorial/Tutorial.html

There is a python library wrapping samtools so many of the samtools calls can be used directly as python objects and calls https://pysam.readthedocs.io/en/latest/

I would use subprocess to call the aligner and specify the output to a bam file that you have named and then read that bam file with pysam to do the analysis that you are interested in.

If you provide more specifics about the analysis or aligners I can help you more but you are in the right track just need to connect the two calls together.

example:

import subprocess
import pysam
import os 

for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, ".fastq"):

        subprocess.run([`hisat`, `-f`, filename `-o`, `some.bam`])

        #get depth
        depth = pysam.depth('some.bam')

edited code snippet based on OP comments about executing on multiple fastq files — Bioathlete, Jul 06 '17 at 02:36

gringer · Answer 3 · 2017-07-06T02:51:18.897

4

If you're just doing alignment and conversion to a sorted BAM file, there's no need to run it through python. A simple pipe on the unix command line works just as well (and probably runs faster):

hisat2 -x genome.index -1 reads_R1.fq.gz -2 reads_R2.fq.gz | 
  samtools sort > reads_vs_genome.bam

edited Jul 06 '17 at 02:51

answered Jul 06 '17 at 00:25

gringer

14,012
5
23
79

thank you @gringer, I am familiar with unix pipe. I am trying to learn to create a pipeline where when I give path of a directory which has fastqs I intend to submit .sh files to cluster and get the desired results . for example python my_pipeline.py -dir /pathtofastq would submit jobs to cluster here i will use bunch of argparse and subprocess and predefinec bsub commands. – novicebioinforesearcher Jul 06 '17 at 01:38
1

Look into python's os.path.walk. This will allow you to take in the dir and then iterate over all the fastq files in the directory. http://www.pythonforbeginners.com/fnmatch/os-walk-and-fnmatch-in-python – Bioathlete Jul 06 '17 at 02:31

Iakov Davydov · Accepted Answer · 2017-07-06T11:16:07.240

I agree that using a specialized tool is probably a good idea.

Nevertheless if you want to stick with Python, I suggest using plumbum instead of subprocess. It has a very nice syntax for this kind of problems.

from plumbum import local

# load a command
samtools = local['samtools']

# you can then easily redirect
(samtools['view', '-bS', '../some.sam'] > '../some.bam')()

# it is also easy to pipe 
hisat2 = local['hisat2']
chain = hisat2['-x', 'genome.index', '-1', 'reads_R1.fq.gz', '-2', 'reads_R2.fq.gz'] | samtools['sort'] > 'reads_vs_genome.bam'

chain()

score 0 · Answer 5 · answered Jul 06 '17 at 09:00

0

Have a look at luigi for writing pipelines in python. You can found BioLuigi if you like :-P

answered Jul 06 '17 at 09:00

Dan

612
3
12

could you please share link for BioLuigi I could not find it online – novicebioinforesearcher Jul 06 '17 at 12:27
2

You need to start it ;-) – Dan Jul 07 '17 at 08:46
haha good one!! – novicebioinforesearcher Jul 07 '17 at 13:58

using python to write bioinformatics pipelines tutorial

5 Answers5