I have three sequencing libraries of single individual mapped to a reference using bwa-mem. I would like to merge the three unsorted .sam files I have so, I can call variants and heterozygosity estimates using atlas. Atlas requires one input mapping file (bam) with defined read groups because it estimates the error profiles of different libraries separately.
How can I merge multiple sam files? Preferably avoiding java (Picard tools).
I tried to figure out a solution using samtools 1.3. I sorted individual files using samtools sort, then I used samtools merge -r merged.bam s1.sort.sam s2.sort.sam s3.sort.sam to merge the sorted files. However, the read group didn't make it to the header (and the variant caller I use is complaining about it), also the read group is stupidly the file name.
I tried to define meaningful readgroup names using procedure described on BioStars, but I found that this will just change the header, it does not adjust the names of read groups defined by samtools merge (the file names).
Following this related thread on SeqAnswers, I tried to define the correct header with read groups corresponding to merged file names:
samtools -rh rg.txt merged.bam s1.sort.sam s2.sort.sam s3.sort.sam
where rg.txt is a file with header
@RG s1.sort
@RG s2.sort
@RG s3.sort
...
output of samtools view -H s1.sort
However, the header still had not the read group, I guess because sam header accepts only tagged items to be specified (something like @RG XY:s1.sort). So, I looked into the merged bam and I found that the tag of RG is Z:. So I tried to just rename header of the merged file using samtools reheader, but then samtools complain about the fact that a tag needs to be of length 2:
Malformed key:value pair at line 123: "@RG Z:s1.sort"
Segmentation fault (core dumped)
I have opened an issue to report this strange incompatibility of read groups generated by samtools merge with samtools reheader.
I would like solution to :
- create a bit more standardized read group names (
SM:Sample\tLB:libraryformat) - avoid pointless writing to disk like in sam -> sorted.sam -> merged.bam case (can be probably achieved through "pipes and tees", thanks @bli)
I also know that I can specify RG to bwa, so the sam files will have read groups defined in the first place. But I do not like the idea of remapping three libraries just to create correct formatting of read groups.
awk. I have to take a look! – Kamil S Jaron Jun 27 '17 at 08:47