5

We generated a (diploid, chordata, highly heterozgous) genome using PacBio and we wanted to see whether it contains lineage-specific duplications (paralogs, basically). The genome is not in Ensembl yet.

The only data we have at the moment are:

  • genome
  • transcript annotation
  • RNAseq

We found some methods from papers:

  • use Blast
  • detect segmental duplications in complete genomes with SDDdetector
  • detecting putative recent segmental duplications or diploid homozygous organisms based on NGS data DuplicationDetector
  • (I just came up with this, but should work) map back the reads to the assembly & analyze the read depth to detect duplicated portions

I'll gladly take advice.

aechchiki
  • 2,676
  • 11
  • 34

1 Answers1

1

This is really tough to do with highly heterozygous animals. What are your genome assembly stats? Specifically, what is your number of contigs, scaffolds, assembly size, and the N50?

If you have proximity ligation data it will be easier to determine if potential paralogs are truly from different regions of the genome or are just from homologous locations on sister chromatids that ended up in uncollapsed in your final assembly.

If I were you and had a good genome, I would start by annotating gene models using RNAseq data, then blasting each gene in the model against all genes in the model. This would give you potential paralogs to start looking at more closely.

conchoecia
  • 3,141
  • 2
  • 16
  • 40
  • Thanks. About my PacBio genome: haplo size ~550Mbp (canu + purge haplotigs), ~1000 contigs, ~1.6 Mbp N50, not yet scaffolded but not sure it's relevant. No proximity ligation data, unfortunately - but I was thinking of checking reliability of paralogs by making sure they are found in an alternative assembly we have (same species, but Illumina, so quite fragmented). I'll test what you suggest in 3rd paragraph too, thanks for the input! – aechchiki Jul 09 '18 at 08:52
  • 1
    Those are great numbers - I'd make sure that you polish with pilon then racon to remove indels before mapping RNAseq reads and generating gene models. – conchoecia Jul 09 '18 at 17:45
  • also @conchoecia can you suggest an alternative to pilon? it is apparently "not currently tuned to the error model of raw PacBio reads, and their use may introduce false corrections" (https://github.com/broadinstitute/pilon/wiki/Methods-of-Operation) – aechchiki Jul 27 '18 at 11:27
  • Ah, for pilon I mean use Illumina data. You might use Arrow (pb reads) then pilon (Illumina reads). – conchoecia Aug 02 '18 at 00:26
  • oh ok! I don't have Illumina data yet. I'll try with Arrow (I tried with Racon already but only seemed to make the assembly worse -.- not sure why). – aechchiki Aug 03 '18 at 10:12