I have a pipeline for generating a BigWig file from a BAM file:
BAM -> BedGraph -> BigWig
Which uses bedtools genomecov for the BAM -> BedGraph part and bedGraphToBigWig for the BedGraph -> BigWig part.
The use of bedGraphToBigWig to create the BigWig file requires a BedGraph file to reside on disk in uncompressed form as it performs seeks. This is problematic for large genomes and variable coverage BAM files when there are more step changes/lines in the BedGraph file. My BedGraph files are in the order of 50 Gbytes in size and all that IO for 10-20 BAM files seems unnecessary.
Are there any tools capable generate BigWig without having to use an uncompressed BedGraph file on disk? I'd like this conversion to hapen as quickly as possible.
I have tried the following tools, but they still create/use a BedGraph intermediary file:
deepTools
Some Benchmarks, Ignoring IO
Here are some timings I get for creating a BigWig file from a BAM file using 3 different pipelines. All files reside on a tmpfs i.e. in memory.
BEDTools and Kent Utils
This is the approach taken by most.
time $(bedtools genomecov -bg -ibam test.bam -split -scale 1.0 > test.bedgraph \
&& bedGraphToBigWig test.bedgraph test.fasta.chrom.sizes kent.bw \
&& rm test.bedgraph)
real 1m20.015s
user 0m56.608s
sys 0m27.271s
SAMtools and Kent Utils
Replacing bedtools genomecov with samtools depth and a custom awk script (depth2bedgraph.awk) to output bedgraph format has a significant performance improvement:
time $(samtools depth -Q 1 --reference test.fasta test.bam \
| mawk -f depth2bedgraph.awk \
> test.bedgraph \
&& bedGraphToBigWig test.bedgraph test.fasta.chrom.sizes kent.bw \
&& rm test.bedgraph)
real 0m28.765s
user 0m44.999s
sys 0m1.166s
Although it is has less features, we used mawk here as it's faster than gawk (we don't need those extra features here).
Parallelising with xargs
If you want to have a BigWig file per chromosome, you can easily parallelise this across chromosome/reference sequences. We can use xargs to run 5 parallel BAM->BedGraph->BigWig pipelines, each using the tmpfs mounted /dev/shm for the intermediary BedGraph files.
cut -f1 test.fasta.chrom.sizes \
| xargs -I{} -P 5 bash -c 'mkdir /dev/shm/${1} \
&& samtools depth -Q 1 --reference test.fasta -r "${1}" test.bam \
| mawk -f scripts/depth2bedgraph.awk \
> "/dev/shm/${1}/test.bam.bedgraph" \
&& mkdir "./${1}" \
&& bedGraphToBigWig \
"/dev/shm/${1}/test.bam.bedgraph" \
test.fasta.chrom.sizes \
"./${1}/test.bam.bw" \
&& rm "/dev/shm/${1}/test.bam.bedgraph"' -- {}
deepTools
Let's see how deepTools performs.
time bamCoverage --numberOfProcessors max \
--minMappingQuality 1 \
--bam test.bam --binSize 1 --skipNonCoveredRegions \
--outFileName deeptools.bw
real 0m40.077s
user 3m56.032s
sys 0m9.276s
<(), FIFO buffers,/dev/stdin). – gringer Jun 07 '17 at 10:25