4

I am working with short-read whole-genome sequences from the NCBI's SRA. I have aligned and sorted all of my short-read sequences and am attempting to index each sequence into .bai format using samtools index, but am running into a couple of errors.

I unpacked the original .sra files in the following manner:

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR6509138.sra

I then aligned each fastq paired-end read, removed duplicate reads, and converted each to bam format like so:

bwa mem xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta SRR6509136_1.fastq SRR6509136_2.fastq | samblaster -e -r| samtools view -Sb - > blasted_SRR6509136.bam

Some of my files did not come as paired-end reads. For those files, I used the --ignoreUnmated flag on samblaster.

I then sorted each bam file like so:

samtools sort blasted_SRR6649368.bam -o sorted_SRR6649368.bam -n

I am attempting to obtain a bai file for each bam using the following command:

samtools index sorted_SRR6649368.bam sorted_SRR6649368.bam.bai

This is the error I run into for the unpaired reads:

[E::hts_idx_push] Chromosome blocks not continuous
[E::sam_index] Read 'HWI-ST387:164:D0CJWACXX:4:1101:1129:13264' with ref_name='ScdB1pO_646;HRSCAF=880', ref_length=33092979, flags=16, pos=23260108 cannot be indexed
samtools index: failed to create index for "sorted_SRR791885.bam": No such file or directory

This is the error I run into for the paired reads:

[E::hts_idx_push] Unsorted positions on sequence #79: 11159020 followed by 11158717
samtools index: failed to create index for "sorted_SRR6649368.bam"

Can anyone help me figure out why this is happening?

annabelperry
  • 199
  • 1
  • 9
  • Are there some instructions you are following that tell you to use these commands? Where has the idea that you need to use -n on the samtools sort command here come from? – John Marshall Jul 25 '20 at 10:07

1 Answers1

4
samtools sort blasted_SRR6649368.bam -o sorted_SRR6649368.bam -n

These error messages indicate that the reads are not sorted by coordinate — in particular, that the reads mapped to ScdB1pO_646;HRSCAF=880 are not all together, and that the reads at positions 11159020 < 11158717 are not sorted by position on their chromosome.

This is because samtools sort -n has been used to sort the reads by name instead. Remove -n to sort by position, which is what is needed to prepare a BAM file for indexing with samtools index.

John Marshall
  • 1,006
  • 8
  • 12
  • Thank you, it worked! As for why I used the -n flag, I am trying replicate the results of a journal article. The journal article did not provide explicit instructions for sorting/indexing the files, and this is my first time sorting/indexing files. I think "sort by reads" simply sounded like the correct thing to do, but it looks like I was incorrect – annabelperry Jul 27 '20 at 16:17