How to transfer gff annotations in genome with extensive duplications?

Question

Microbial genomes can contain extensive duplications. Often we'd like to transfer annotations from an annotated species to one that is newly sequenced.

Existing tools (e.g. RATT, LiftOver, Kraken) either make specific assumptions about how closely related the species are or fail to transfer when multiple matches are found in the new genome, especially if the sequences are highly similar.

Specifically, I have a synthetic biology application where genes can duplicate extensively. They are identical in sequence but duplicated many times and be relocated (i.e., not just adjacent to each other). None of the above mentioned tools are able to transfer coordinates of annotations to genomes with multiple copies of features.

Are there any pre-existing tools or software that transfer annotations in this scenario? Ideas for ways to do this robustly?

Can you provide more detail? What's your input (e.g. raw reads, assemblies, open-reading frames)? what kind of sequencing? what is the exact output you want? ' I'm not sure I understand your point about the assumptions. Also, LiftOver and Kraken are completely different tools with different uses — Chris_Rands, May 31 '17 at 11:19
Would this even be theoretically possible? How can you assume the annotations are transferable if there are extensive duplications? It's probably better to look for homologs instead. — terdon, May 31 '17 at 11:22
@terdon do you mean orthologs? homologs = orthologs (non-duplicated) + paralogs (duplicated) — Chris_Rands, May 31 '17 at 11:25
@Chris_Rands No, I meant homologs. Precisely because we can't know whether they are ortho- or para- (I have a nice post on the difference between the two here, by the way), so all you can do first is find homologs and then try and figure out whether they are similar enough to carry any annotations over. — terdon, May 31 '17 at 11:54
@terdon I see well resolving orthologs/paralogs is not easy of course, but it can be done, depending on the exact data (I don't know what the OP's data looks like), for example some of my colleagues maintain orthodb http://www.orthodb.org/ — Chris_Rands, May 31 '17 at 12:03
Oh, of course it can be done! My point was that looking for regions of homology (of whatever type) seems like a better way of transferring annotations than attempting to translate genomic coordinates between genomes of different species. — terdon, May 31 '17 at 13:35
@Chris_Rands: Input would be assemblies, e.g. de novo from gDNA sequencing. Output would be a transfer of annotations (e.g. gff format) from characterized species to newly assembled genome (coordinate transfer). Both LiftOver and Kraken (this one, just to make sure were on same page: https://github.com/nedaz/kraken) do this. LiftOver more appropriate for coordinate transfer between closely related sequences e.g. different assemblies; Kraken uses genome alignment (MUMer, Satsuma) so better for more divergent sequences. — scalefreegan, May 31 '17 at 13:43
@terdon: distinguishing between type/origin of homology would go beyond scope of what I would want to accomplish, but the difference is important as you point out. also right to say that transferring smaller homologous regions would be better, especially for diverged species. fyi about application: I have a syn bio application where genes can duplicate extensively. they are identical in sequence but duplicated many times and relocated (i.e. not adjacent). None of the above mentioned tools was able to transfer coordinates of annotations to genomes with multiple copies of annotation. — scalefreegan, May 31 '17 at 13:49
Yes, nor would I expect them to. That's what I was saying. The liftover tools simply map coordinates, they won't be able to deal with this sort of thing. I am afraid you will have to do it manually by getting a list of genes/proteins of interest, finding their homologs and transferring the annotations over (with the obvious caveats about whether or not you can assume the annotations are transferable). Won't be much fun, unfortunately. — terdon, May 31 '17 at 13:54
apologies I thought you meant kraken: http://ccb.jhu.edu/software/kraken/, who names these tools? anyway, this is quite non-trivial to do properly. you'll need to do genome assembly, gene predictions and ortholog/paralog assignment; there are various pipelines (some reviewed here: https://www.ncbi.nlm.nih.gov/pubmed/27043882), but they'll take some time. alternatively, for something more 'quick and dirty', @terdon 's suggestions seems sensible — Chris_Rands, May 31 '17 at 13:55

BaCh · Answer 1 · 2017-05-31T17:43:32.240

5

There is one very simplistic way I use which might work for what you are doing, it is similar to what terdon proposed.

Take a de-novo microbial genome annotation tool (I have my own, but you could use/modify prokka). Tools like these often first predict gene boundaries (with other tools like prodigal or glimmer) and then try to assign a function to found genes. This function assignment is often done with BLAST and other tools ... and that is where you can go in and modify to do what you need.

I use a "knowledge" protein database of genes I want to have very strictly annotated as a first line of annotation (e.g. in your case: the annotated genomes). For that I loop through very strict identity/similarity parameters which get gradually relaxed.

E.g.: Loop 0: only transfer annotations at 100% DNA identity, same length. Loop 1: only transfer annotations at 100% similarity, same length. Loop 2: only transfer annotations at 99% similarity, length +/- 1%. ... Loop n: only transfer annotations at 100-(n-1)% similarity, length +/- (n-1)%.

In each loop, obviously only annotate what has not been annotated in previous loops.

After that, use "normal" annotation pipeline of the tool to annotate the rest.

edited May 31 '17 at 17:43

answered May 31 '17 at 15:34

BaCh

734
4
9

Doesn't that require the target genome's genes to have been found first? Or can your tool also do de-novo gene prediction? (sounds like a very useful tool, by the way, kudos!) – terdon May 31 '17 at 16:33
Prokaryotik gene finding/prediction is a more or less solved problem, existing tools work reasonably well. See http://prodigal.ornl.gov/ and http://prodigal.ornl.gov/ (just to name two). – BaCh May 31 '17 at 17:27
Yes, I know, I was just surprised you didn't mention that in your answer. If I understand correctly, the first step would be for the OP to find the list of putative genes in their newly sequenced genome, right? – terdon May 31 '17 at 17:31
Correct. Prokka (http://www.vicbioinformatics.com/software.prokka.shtml) uses a whole battery of third-party tools (including prodigal) to annotate a genome de-novo, I started by modifying prokka before I wrote my own one, which uses some ideas from the prokka pipeline. – BaCh May 31 '17 at 17:37

score 3 · Answer 2 · edited Jun 18 '20 at 08:30

I think you will have to first identify the regions homologous to the ones defined in your GFF and then transfer the annotations. Of course, the assumption there is that the homolog will also have the same annotation which is often not true. However, I don't see how you can do it in any other way since you cannot use genomic coordinates (and you would still be making the same assumption even if you could, anyway) when the genomes are so different.

For a very simplistic approach (which might be enough if, as you say, your sequences are almost identical), you can do something like:

Collect the sequences of interest from your already annotated species.
Use a tool like genewise or exonerate to map these into the target genome. Both tools can return gff-formatted output and both can find multiple hits in the target genome. For what you want, I would suggest using a very high threshold of sequence similarity and query coverage (where the target sequence found covers all or most of the query sequence used).

Since these are microbial genomes and therefore splicing isn't a problem, you could do the same thing with even a simple BLASTn or tBLASTn if you start from protein sequences.
At this point, you should have a list of homologs (some of which will be orthologs and others paralogs) and you can transfer the annotations of the query sequence over to the target.

Again, I stress that this is making a whopping huge assumption: that homologous sequences have the same function and can automatically be annotated as whatever you had in the query genome. This is going to be true for many cases but it will also be false for others. Especially if you are looking at paralogs (genes whose duplication occurred after the speciation event and are therefore likely to have diverged in function).

However, as I said before, this problem would be exactly the same even if you did manage to transfer annotations just by identifying the syntenic regions of the genomes¹, so there's not much difference there.

¹ _{As I said in the comments, I don't see how this could be possible. By definition, if you have extensive duplications, the genomic coordinates will be completely different and it is impossible to map from one genome into the other.}

How to transfer gff annotations in genome with extensive duplications?

2 Answers2