9

I'd like to call diploid SNV variants from long-reads data (~80SMRTcells PacBio).

I have generated a draft reference genome for an indivudual from a heterozygous (~4%) species (Canu+Haplomerger2).

I can use this reference for some applications:

  • RNAseq mapping;
  • functional annotation;
  • duplicate detection;
  • synteny analysis

but not others:

  • analysis of compound heterozygotes;
  • allele-specific expression;
  • variant linkage

This is mainly why I need phasing. FYI, I tried Falcon + FalconUnzip too, but the daligner phase was just taking too much time/resources and I need to find a strategy to make it work in reasonable time (working on it).

Meanwhile, for phasing, It was suggested to use HapCUT2 - from Canu's FAQ

(1) "Avoid collapsing the genome" in Canu to get raw diploid assembly as complete & separate as possible (2) HaploMerger2 to get high-quality reference and alternative haploid assemblies (3) HapCUT2 or other phasing tools to get the high-quality haplotype assembly based on the reference haploid assembly.

However, HapCUT2 requires the input of a diploid SNV file. Which could be done if I had short-reads data, but yet haven't found a method for doing so with in my hands just long-reads data.

Strategies I tried to get the SNVs:

  • marginPhase : method paper

    • but I had a issue in running it
  • Clairvoyante : method paper

    • but I currently don't have access to suitable GPU machines to build the model - and I can't use the models provided by the authors because my organism is too different from human

Any input would be greatly appreciated!

aechchiki
  • 2,676
  • 11
  • 34
  • At 4% heterozygosity, falcon-unzip is probably the best hope. Trio binning would be better but you don't have the right data. – user172818 Nov 22 '18 at 16:20
  • @user172818 is it possible in your knowledge to run Falcon-Unzip on non-Falcon output? From here it looks like it's a no. – aechchiki Nov 23 '18 at 10:00

2 Answers2

2

You could try aligning your reads to the draft reference genome with for example minimap2 and calling variants with freebayes. It appears there is a protocol for long reads: https://github.com/ekg/freebayes#re-genotyping-known-variants-and-calling-long-haplotypes

Disclaimer: I have not run freebayes myself, and it may be that this is a terrible idea. I would expect that calling SNVs (Single-nucleotide variants) from PacBio reads is very challenging. However, with enough coverage, I think it could be possible.

winni2k
  • 2,266
  • 11
  • 28
2

You may want to try the longshot tool (https://github.com/pjedge/longshot) developed for calling variants in diploid genomes from long read data.

user4162
  • 36
  • 1