10

Sometimes it useful to perform a nucleotide protein coding gene sequence alignment based on codons, not on individual nucleotides. For example for further codon model analysis it is important to have full codons.

A widely used approach here is to perform a protein sequence alignment first and then impose this alignment to the nucleotide sequences using PAL2NAL, CodonAlign or something similar.

This is how transAlign or GUIDANCE (in codon mode) work.

The problem here is that you are discarding part of the information which could be potentially used for the sequence alignment. E.g. if you have slowly evolving low-complexity region adjacent to a quickly evolving one, the amino acid induced alignment could be wrong, while incorporating nucleotide sequence potentially allows to make the alignment more accurate.

I'm aware of two programs which can do true codon alignment. First, PRANK has a dedicated codon model, but it is rather slow and using it is overkill for certain problems. Second, Sequence Manipulation Suite can perform codon alignments, but only for a pair of sequences; also it's javascript based, therefore it is hard to run it for a large number of sequences.

Can you recommend any software for multiple codon sequence alignment? Preferably available for offline use.

M__
  • 12,263
  • 5
  • 28
  • 47
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
  • 2
    Why not translate the sequence and perform amino acid alignment? True, you “discard” the codon isoacceptor identity but that can be mapped back trivially if needed. – Konrad Rudolph Jul 05 '17 at 16:02
  • As I write in the question, this way you loose some information. E.g. you can produce an incorrect amino acid alignment for low-complexity region, which you otherwise would be able to reconstruct MSA properly. In principle I agree that this doesn't happen very often. – Iakov Davydov Jul 05 '17 at 16:57
  • There's blastx and tblastx but I don't think they do true codon alignment. – GWW Jul 16 '17 at 17:25
  • @IakovDavydov how does a codon alignment 'loose some information' over a simple DNA-DNA alignment? It's easier to find homologous residues using proteins (instead of DNA), and using that information to align DNA triplets can only be beneficial. – PejoPhylo Mar 20 '19 at 20:20
  • @PejoPhylo I was not claiming that codon alignment looses some information. I was saying that when converting codons to amino acid some information is lost. – Iakov Davydov Mar 21 '19 at 09:00
  • @IakovDavydov I'm sorry, but then I don't understand what you originally asked in this question. Couldn't you use PAL2NAL for your (multiple) codon alignment? It supports offline use – PejoPhylo Mar 21 '19 at 10:07
  • @PejoPhylo I don't think I can make it much clearer here than in the original question. But briefly: with PAL2NAL you loose some information, which might be important for some alignments. Most of the time that's not important, but if your dN is comparable with dS and GC-content stays constant, that could make a difference. This paper provides some details: https://doi.org/10.1093/molbev/msr272 – Iakov Davydov Mar 21 '19 at 17:09
  • @IakovDavydov thanks, I'll give it a read. – PejoPhylo Mar 21 '19 at 17:13
  • @IakovDavydov I don't think that your original question is correct. If one is interested in the functional aspect (the protein), then any codon based aligner will work, for example MEGA. By definition, the amino acids are being aligned. The question about the low complexity, slowly evolving region is mute. If you are interested in the codon differences and drift, or evolution of the codons, then yes - but, I think that using nt aligned by codon is done more from the perspective of bundling the nt into triplets. – Andor Kiss Sep 30 '22 at 14:27

3 Answers3

4

I don't know of any transcript-to-transcript aligners that are able to do this, but LAST can align transcript queries to protein reference sequences using a specified frameshift cost. Here's the specific documentation for that option:

-F COST

Align DNA queries to protein reference sequences, using the specified frameshift cost. A value of 15 seems to be reasonable. (As a special case, -F0 means DNA-versus-protein alignment without frameshifts, which is faster.) The output looks like this:

a score=108 s prot 2  40 + 649
FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD s dna  8 117 + 999
FFLQ-IKLWDP\STPH*IVSSP/PSDLISAHTLCPRMKSQDN

The \ indicates a forward shift by one nucleotide, and the / indicates a reverse shift by one nucleotide. The * indicates a stop codon. The same alignment in tabular format looks like this:

108 prot 2 40 + 649 dna 8 117 + 999 4,1:0,6,0:1,10,0:-1,19

The "-1" indicates the reverse frameshift.

I sent an email to the LAST mailing list about adding a frameshift penalty for transcript-to-transcript matching; I've been pleasantly surprised with the requested features that Martin Frith has added to LAST in the past. Unfortunately, in this case the problem is too difficult to sort out due to all the possible combinations that could happen, so it's unlikely to be implemented in LAST in the forseeable future (unless someone else writes that code).

gringer
  • 14,012
  • 5
  • 23
  • 79
3

Try MACSE v2 (https://academic.oup.com/mbe/article/35/10/2582/5079334) will align multiple protein-coding nucleotide sequences based on their amino acid translation while allowing for the occurrence of frameshifts

user90
  • 131
  • 1
2

Virulign does this for virus sequences, the publication is available here and github address here

M__
  • 12,263
  • 5
  • 28
  • 47
  • Hi @kristoftheys welcome to the site, congratulations on Bioinfo paper. I've edited the post to include publication and github address. – M__ May 01 '20 at 17:26
  • @Michael many thanks for the completion. sorry for the inconvenience, next time better :) – kristof theys May 03 '20 at 08:26