I would like to perform a samtools mpileup from a single file that contains thousands of read groups with different SM tags. I could split the bam by read group using samtools split, and then perform the mpileup, but splitting into thousands of files and then performing an mpileup from thousands of files is crazy slow on my machine.
I could imagine an implementation using pysam, but I was hoping someone has a solution already.
Bcftools mpileup already has this feature, but the feature was never implemented for the original mpileup format, which is what I need.
An example
Contents of input.sam:
@HD VN:1.6 SO:unknown
@SQ SN:chr1 LN:1000000
@RG ID:0 SM:sample_0
@RG ID:1 SM:sample_1
r0 0 chr1 24 0 1M * 0 0 G I RG:Z:0
r1 0 chr1 24 0 1M * 0 0 G I RG:Z:1
When I run samtools mpileup I get this behavior:
> samtools mpileup input.sam
[mpileup] 2 samples in 1 input files
chr1 24 N 2 ^!G$^!G$ II
Whereas, what I actually want is this output:
[mpileup] 2 samples in 1 input files
chr1 24 N 1 ^!G$ I 1 ^!G$ I
For a long time I thought this was a bug in samtools mpileup, but it appears that it is a feature that was described/hinted at in the samtools documentation, but was never implemented: https://github.com/samtools/samtools/issues/599#issuecomment-604941884