From a single record, I can get the reference sequence ID from the field rID, but how can I get the total number of different reference sequences stored in the whole SAM file? It can be simple as just looping through all records, but I need some other solution.
2 Answers
There are multiple ways. Here are two:
samtools idxstats ‹bamfile›samtools view -H ‹bamfile› | grep '^@SQ'
Both commands give you a list of the references in the BAM file. To get their number, simply append | wc -l. But note that idxstats also outputs a final entry, *, which doesn’t correspond to any reference but to all unmapped reads. So subtract that from the number.
However, beware that while this lists all references associated with a BAM file, not every reference necessarily has records associated with it. So there might be no reads mapping to a given reference. This is the third column in the idxstats output. So, to get only those references with at least one mapped read, we can filter for those:
samtools idxstats ‹bamfile› | awk '$4 != 0' | head -n-1 | wc -l
Oh, and these (and other) commands work on BAM files, not SAM. The idxstats command additionally needs the BAM index, since it’s actually reading information from that index file, not the BAM (this also means that it only works on sorted BAM files). Working on SAM would require looping through the whole file, which is quite inefficient. You should generally be working with BAM rather than SAM, so this is hopefully not a restriction.
To do the same on a SAM file, you need to iterate over the file:
awk '{print $3}' ‹samfile› | sort -u | wc -l
Or:
awk '{refs[$3] = 1} END {for (ref in refs) print ref}' ‹samfile› | wc -l
- 4,845
- 14
- 45
A SAM file stores alignments to a set of reference sequences. However, those reference sequences are not usually stored in the file. Instead, they will be in a separate sequence file that is usually in FASTA format. You can get the number of reference sequences stored in that file using the command
grep -c '^>' reference.fasta
- 2,266
- 11
- 28
rIDare you talking aboutRNAME(reference sequence name) field? That's in the header. – Devon Ryan May 14 '18 at 14:42