Suppose I have a BAM file indicating where reads in a library have mapped, and a bed file describing a set of genomic regions.
Is there a way to easily get the size distribution of the reads mapping on this set of regions?
Suppose I have a BAM file indicating where reads in a library have mapped, and a bed file describing a set of genomic regions.
Is there a way to easily get the size distribution of the reads mapping on this set of regions?
This can be done rather simply by combining this samtools based answer and bioawk:
samtools view -L regions.bed aligned_reads.bam \
| bioawk -c sam '{hist[length($seq)]++} END {for (l in hist) print l, hist[l]}' \
| sort -n -k1
The samtools command extracts the alignements in the region given with the -L option, in sam format. The -c sam option of bioawk makes the $seq variable (corresponding to the read sequence field) available in the awk command.
The final sort ensures the size classes are ordered numerically.
We obtain a two-column size count table.
Examples with timings:
$ time samtools view -L /Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/protein_coding.bed results_Ribo-seq_test_1/bowtie2/mapped_C_elegans/WT_1/18-30_on_C_elegans_sorted.bam | bioawk -c sam '{histo[length($seq)]++} END {for (l in histo) print l, histo[l]}' | sort -n -k1
18 201
19 560
20 1017
21 2568
22 5554
23 11221
24 27637
25 38797
26 17042
27 9853
28 15181
29 5888
30 528
real 0m0.592s
user 0m0.816s
sys 0m0.036s
Since sam format is actually simple to parse, one can try to use mawk instead for slightly higher speed (the sequence being in the 10th field):
samtools view -L regions.bed aligned_reads.bam \
| mawk '{hist[length($10)]++} END {for (l in hist) print l"\t"hist[l]}' \
| sort -n -k1
(Contrary to bioawk, the default output field separator is not a tabulation.)
Example:
$ time samtools view -L /Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/protein_coding.bed results_Ribo-seq_test_1/bowtie2/mapped_C_elegans/WT_1/18-30_on_C_elegans_sorted.bam | mawk '{histo[length($10)]++} END {for (l in histo) print l"\t"histo[l]}' | sort -n -k1
18 201
19 560
20 1017
21 2568
22 5554
23 11221
24 27637
25 38797
26 17042
27 9853
28 15181
29 5888
30 528
real 0m0.362s
user 0m0.500s
sys 0m0.004s
Via BEDOPS bedmap and bam2bed, along with the hist functionality in bashplotlib
$ bedmap --echo-map-size --multidelim '\n' intervals.bed <(bam2bed < reads.bam) | hist
77| o
73| o
69| o
65| o
61| o
57| o
53| o
49| o
45| o
41| o
37| o
33| o
29| o
25| o
21| o
17| oo
13| ooo
9| ooo
5| ooo
1| o o oooo
-----------
------------------------------
| Summary |
------------------------------
| observations: 118 |
| min value: 16.000000 |
| mean : 34.415254 |
| max value: 36.000000 |
------------------------------