Is there a more elegant solution to the bwa-mem: paired reads have different names error?

Question

I'm currently trying to run bwa-mem on Influenza substrains using the following command:

~/bwa mem h5n1_1_cons.fa h5n1_1_read1.fq h5n1_1_read2.fq

h5n1_1_cons.fa is the consensus sequence for substrain h5n1_1, and the fq files are paired end reads 1 and 2. Running the program gives me this error:

[mem_sam_pe] paired reads have different names: "JQ714243-920-1", "JQ714243-920-2"

Visual inspection shows that the read names are identical until the last character. I have found a package (bbmap) with a utility, bbrename.sh, that renames the sequences, producing:

./bbrename.sh in=h5n1_1_read1.fq out=fixed prefix=h5n1_1

which yields the following output

-bash-4.2$ head fixh5n1_1_read1.fq
@h5n1_0
gaaaatgtctcaggagttgatgcatggtgtttaatctttggtttgcgctg
+
@4>C5?7FA?9@@@;1,-4B+E??4E9,DA>6CB)*2.)?7/.'65+75/
@h5n1_1
catcagttggattttgcctaaggatgtccaccatccttatcccaccaatt
+
7(9?)=C3AEF+@@F=C-E>EE57CD:;'?.*?7;>:3>2.;29..)=2'
@h5n1_2
catgcatttgttggaataatgttctcacaaatcaactgtattgaccgctg

This fixes the problem and allows BWA-MEM to work. However, header information is discarded. Is there another way to ensure the header names are the same while retaining all header information?

Please [edit] your question and show us a few examples of the read names in your original input files. The output of awk 'NR % 4 == 1{print $1}' h5n1_1_read1.fq | head and awk 'NR % 4 == 1{print $1}' h5n1_1_read2.fq | head for example. — terdon, Nov 29 '17 at 11:08
Since you mention bbmap already, why not use that for mapping your reads altogether? It is after all called bbmap — holmrenser, Nov 30 '17 at 07:47

Devon Ryan · Accepted Answer · 2017-11-29T12:22:52.360

7

Assuming the read name lines only contain things like @JQ714243-920-1 and nothing else, then sed -i '1~4 s/-[12]$//' file.fastq (N.B., example updated a bit for safety, thanks to @terdon) will remove the -1 and -2 suffixes.

However, you should try to figure out why the read names were munged into that form to begin with (at least Illumina machines won't produce that, I don't know about others).

edited Nov 29 '17 at 12:22

answered Nov 29 '17 at 11:15

Devon Ryan

19,602
2
29
60

You could also replace the the second - with a space. Most mapping tools only regard the part of the id before the first space as the read name. – Ian Sudbery Nov 30 '17 at 10:57

Is there a more elegant solution to the bwa-mem: paired reads have different names error?

1 Answers1