How can I easily get the read size distribution of reads mapping on a certain set of regions?

Question

Suppose I have a BAM file indicating where reads in a library have mapped, and a bed file describing a set of genomic regions.

Is there a way to easily get the size distribution of the reads mapping on this set of regions?

bli · Accepted Answer · 2018-03-21T16:24:05.467

This can be done rather simply by combining this samtools based answer and bioawk:

samtools view -L regions.bed aligned_reads.bam \
    | bioawk -c sam '{hist[length($seq)]++} END {for (l in hist) print l, hist[l]}' \
    | sort -n -k1

The samtools command extracts the alignements in the region given with the -L option, in sam format. The -c sam option of bioawk makes the $seq variable (corresponding to the read sequence field) available in the awk command.

The final sort ensures the size classes are ordered numerically.

We obtain a two-column size count table.

Examples with timings:

$ time samtools view -L /Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/protein_coding.bed results_Ribo-seq_test_1/bowtie2/mapped_C_elegans/WT_1/18-30_on_C_elegans_sorted.bam | bioawk -c sam '{histo[length($seq)]++} END {for (l in histo) print l, histo[l]}' | sort -n -k1
18  201
19  560
20  1017
21  2568
22  5554
23  11221
24  27637
25  38797
26  17042
27  9853
28  15181
29  5888
30  528

real    0m0.592s
user    0m0.816s
sys     0m0.036s

Since sam format is actually simple to parse, one can try to use mawk instead for slightly higher speed (the sequence being in the 10th field):

samtools view -L regions.bed aligned_reads.bam \
    | mawk '{hist[length($10)]++} END {for (l in hist) print l"\t"hist[l]}' \
    | sort -n -k1

(Contrary to bioawk, the default output field separator is not a tabulation.)

Example:

$ time samtools view -L /Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/protein_coding.bed results_Ribo-seq_test_1/bowtie2/mapped_C_elegans/WT_1/18-30_on_C_elegans_sorted.bam | mawk '{histo[length($10)]++} END {for (l in histo) print l"\t"histo[l]}' | sort -n -k1
18  201
19  560
20  1017
21  2568
22  5554
23  11221
24  27637
25  38797
26  17042
27  9853
28  15181
29  5888
30  528

real    0m0.362s
user    0m0.500s
sys     0m0.004s

score 1 · Answer 2 · answered Mar 27 '18 at 22:23

Via BEDOPS bedmap and bam2bed, along with the hist functionality in bashplotlib

$ bedmap --echo-map-size --multidelim '\n' intervals.bed <(bam2bed < reads.bam) | hist

 77|           o
 73|           o
 69|           o
 65|           o
 61|           o
 57|           o
 53|           o
 49|           o
 45|           o
 41|           o
 37|           o
 33|           o
 29|           o
 25|           o
 21|           o
 17|          oo
 13|         ooo
  9|         ooo
  5|         ooo
  1| o    o oooo
    -----------

------------------------------
|          Summary           |
------------------------------
|     observations: 118      |
|    min value: 16.000000    |
|      mean : 34.415254      |
|    max value: 36.000000    |
------------------------------

How can I easily get the read size distribution of reads mapping on a certain set of regions?

2 Answers2