5

I am used to gzip/biopython solutions when dealing with sequencing data, but now I wish to switch to more elegant pysam. So I looked at the manual, but ran into quite bizarre troubles with the first couple of lines using my bam file

import pysam
samfile = pysam.AlignmentFile("3_Tms_1_mapped.bam", "rb")
for read in samfile.fetch('3_Tms_b3v08_scaf000159'):
    print(read)
samfile.close()

returns ValueError: fetching by region is not available for SAM files. Well, the file is bam. I tried to google the error but the only hits I found are the lines in the source code of pysam that check if the file is bam/cram or sam, so somehow pysam thinks that my bam is a sam. How can I convince it otherwise? I have also noticed that the manual is for python 2.7, that's maybe where the problem comes from...

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59

2 Answers2

5

That isn't actually a bam file as John Marshall figured out. I am keeping the rest of my answer since it could be useful to someone else, but the issue here was that you had a compressed (bgzipped) sam file and not an actual bam file and that's why you were getting that error. When I sorted your file in preparation for indexing it, I converted to a bam which is why the rest of my answer worked.


You don't have the index file for the bam file you're using. I used this script on the file you linked to (changing the name so that it corresponds to the right file and a contig in that file):

#!/usr/bin/env python3

import pysam
samfile = pysam.AlignmentFile("3_Tce_1_mapped.bam", "rb")
for read in samfile.fetch('3_Tce_b3v08_scaf005149'):
    print(read)
samfile.close()

The directory I ran it in had:

$ ls 3*
3_Tce_1_mapped.bam

And I got the error you described:

$ foo.py
Traceback (most recent call last):
  File "/home/terdon/scripts/foo.py", line 5, in <module>
    for read in samfile.fetch('3_Tce_b3v08_scaf005149'):
  File "pysam/libcalignmentfile.pyx", line 1107, in pysam.libcalignmentfile.AlignmentFile.fetch
ValueError: fetching by region is not available for SAM files

However, indexing the bam file fixed it:

$ samtools sort 3_Tce_1_mapped.bam > 3_Tce_1_mapped.sorted.bam 
$ mv 3_Tce_1_mapped.sorted.bam 3_Tce_1_mapped.bam
$ samtools index 3_Tce_1_mapped.bam 
$ ls 3*
3_Tce_1_mapped.bam  3_Tce_1_mapped.bam.bai

$ foo.py | wc
227    2724   16725

So just sort and index your files before attempting to seek in them. Which makes sense since the index's job is primarily to allow seeking.

terdon
  • 10,071
  • 5
  • 22
  • 48
5

Your 3_Tms_1_mapped.bam file, despite its filename extension, is in fact a bgzipped SAM file. You can verify this using htsfile, which is a small utility packaged with HTSlib:

$ htsfile 3_Tms_1_mapped.bam 
3_Tms_1_mapped.bam: SAM version 1.3 BGZF-compressed sequence data

(For files that really are in BAM format, it reports BAM version 1 compressed sequence data.)

So the error message is accurate in this case.

John Marshall
  • 1,006
  • 8
  • 12
  • Hmm, does it mean that this link is wrong? http://biopython.org/DIST/docs/api/Bio.bgzf-module.html I sort of thought that it's the same thing... – Kamil S Jaron Apr 10 '19 at 21:52
  • 1
    No. (Was there a particular part of that you think might be wrong?) But there is a difference between BGZF-compressing the plain text format, and a BAM file (whose decompressed underlying stream is a bespoke binary format that is different from the plain SAM text). – John Marshall Apr 10 '19 at 21:58
  • I think I get it now. I got confused by the aims to this module, they write "In addition to being required for random access to and writing of BAM files", so I got an impression that I can simply use this module to write BAM files. This was actually really helpful answer, thanks! – Kamil S Jaron Apr 11 '19 at 07:05
  • 1
    @KamilSJaron you might want to consider accepting this answer instead of mine since this was the real issue. – terdon Apr 11 '19 at 08:10