Highly heterozygous reads mapping

Question

I have short (67bp) Hi-C reads from a highly heterozygous organism (~15% between-haplotype SNP divergence) and I have both reference haplotypes.

I wanted to try and compare different haplotyping softwares for Hi-C reads using these reads as benchmarking dataset. When mapping the reads separately to each haplotype, I get good mapping statistics. When I map the reads to a single haplotype with all the heterozygous SNPs masked (into N's), I get very poor mapping rates.

I would like to be able to map the reads when the real haplotypes are unknown (reference is a mix of haplotypes).

I use minimap2 to map the reads with the sr preset. I tried decreasing the mismatch penalty (-B) to 1 and increasing the --score-N value, but this had no effect.

As shown in the attached IGV screenshot, coverage drops to 0 when the SNP density increases. Is it feasible to map reads with such a high heterozygosity on a single (masked) reference ? Should I use another tool ?

Could you describe your reads a bit more? DNA/mRNA? SE/PE? 50bp/100bp/etc? If I were to make a guess, I'd say you're having issues with properly seeding your reads for alignment. Also, it might be helpful to know why you need to map masked reads when mapping separately works fine. — LucasBoatwright, Feb 13 '19 at 15:16
Thanks for the feedback ! I updated the question to add more background and infos. — cmdoret, Feb 19 '19 at 09:16
Just to confirm it is heterozygous and not highly repetitive? — M__, Feb 19 '19 at 10:10
With this levels of heterozygosity you can just assemble everything and get the haplotypes separated. I am not sure if this is an optimal benchmarking dataset. — Kamil S Jaron, Feb 20 '19 at 10:50
You're right, I'll be using another dataset for this, but I was still curious about the feasibility :) — cmdoret, Feb 21 '19 at 11:31

score 4 · Accepted Answer · answered Feb 19 '19 at 14:15

I believe you may be able to map your reads, but I don't know how to do that with minimap2.

I recommend running gsnap, which is more SNP tolerant and provides a number of parameters which are likely to help.

For instance, I believe that most aligners will treat 'N' characters as mismatches when you align. GSNAP has parameters to account for this.

--query-unk-mismatch=INT       Whether to count unknown (N) characters in the query as a mismatch
                               (0=no (default), 1=yes)
--genome-unk-mismatch=INT      Whether to count unknown (N) characters in the genome as a mismatch
                               (0=no, 1=yes (default))

It also has a mismatch parameter similar to the one you described for minimap2.

-m, --max-mismatches=FLOAT     Maximum number of mismatches allowed

Try running with the 'genome-unk-mismatch' parameter above (with the masked reference). That may be your best bet. There are also other parameters that may help, but this should be a good start.

Highly heterozygous reads mapping

1 Answers1