4

I have a long list of read IDs of interest to me in a file called read_names.txt. it is simply in the format:

m54197_180831_211346/4981510/ccs
m54197_180831_211346/6226723/ccs
m54197_180831_211346/6619607/ccs
...

etc where these are the actual read names from a fastq file. I am then trying to find those in a bam file for which the fastq has been mapped, and I can accomplish this like:

for ID in `cat read_names.txt`
do
samtools view inbam.bam | grep $ID >> read_locs.sam
done

However, this method is obscenely slow because it is rerunning samtools view for every ID iteration (several hours now for 600 read IDs), and I was hoping to do this for several read_names.txt files.

I tried sort of flipping the script a bit and running samtools view first but it only returned the first read ID present in the file and stopped:

samtools view inbam.bam | for ID in `cat read_names.txt`; do grep $ID >> read_locs.txt; done

Am I missing something or is this the only way to accomplish this task?

d_kennetz
  • 631
  • 5
  • 17

2 Answers2

9

It is still slow but grep has a -f option to take in a file

samtools view inbam.bam | grep -f read_names.txt > read_locs.txt

Bioathlete
  • 2,574
  • 12
  • 29
  • This took 30 seconds as opposed to several hours, so I'd say the overhead was definitely in samtools view. I feel silly because of how simple this was. Thank you! – d_kennetz Dec 06 '18 at 20:18
  • just as a warning this can take much longer if the read name file is big. – Bioathlete Dec 06 '18 at 20:24
  • I appreciate the heads up! Comparatively, I do not think it will ever take longer than the method I was using so it is still a great answer. (This is serving as a caveat to your warning). – d_kennetz Dec 06 '18 at 20:26
2

samtools can do this natively too using -N, --qname-file CLI option:

samtools view -N read_names.txt in.bam > read_locs.sam
❯ samtools --version
samtools 1.16.1
Using htslib 1.16
bricoletc
  • 161
  • 3