9

I'm faced with having to align many (some 100s) bacterial genomes, where the genome length is in the millions. Obviously, this is beyond normal alignment techniques and it's unclear to me what the best practice is for such circumstances:

  • conventional alignment on a very powerful computer with lots of memory
  • break up the genome into smaller fragments and align them individually
  • some exotic different procedure

What possible avenues of attack are there?

(I've attempted to use Mafft and Clustal with little success)

agapow
  • 788
  • 3
  • 11
  • 1
    A standard MSA tool like Clustal will assume that there are no rearrangements between sequences. Unless all of your sequences are completely colinear, you'll need a dedicated genome alignment method like MUMmer. – heathobrien Jun 14 '17 at 14:29
  • Adding some clarifications: there are no large scale rearrangements (otherwise I wouldn't try this). Most solutions are pairwise or to a reference, which isn't what I want to do here. – agapow Jun 14 '17 at 14:34
  • I wouldn't expect Clustal to work but what went wrong with MAFFT? I'm pretty sure that's been used for whole genome alignments on vertebrate data, so I would be surprised if it failed for bacteria. – terdon Jun 14 '17 at 14:36
  • There are some tips here to run mafft on a relatively small number of long sequences. – heathobrien Jun 14 '17 at 14:40
  • Also, could you explain why you are doing this? I ask on the off chance that this is an XY problem and whole genome alignments aren't actually the best way to achieve whatever your end goal is. – terdon Jun 14 '17 at 14:46
  • Answers to above questions: The "quick" Mafft alignment was visibly poor, while the higher quality one never completed. I'm trying to understand variation across a large set of bacteria gathered from a variety of sites and circumstances and see what may influence phenotype. – agapow Jun 14 '17 at 14:56
  • Please [edit] your answer to provide clarifications. Comments are hard to read, easy to miss and can be deleted without warning. That said, if you are looking for variation and things that influence phenotype, why would you go for whole genome alignments? That doesn't seem likely to be of much help. You might want to post a new question explaining what you are trying to do and ask for help designing your approach. For example, I would most likely first identify the genes of your species (either using homology based approaches or de-novo gene prediction) and align those. – terdon Jun 14 '17 at 15:45
  • As for the quick mafft, yes, I'm not surprised it was bad, it isn't designed for WG alignments. The full one will of course take a long time. This is a big job. How long did you run it for and on what hardware? Multiple WG alignments aren't the sort of thing you can do on your laptop but, luckily, they are very rarely necessary. – terdon Jun 14 '17 at 15:47
  • 1
    Have you considered using LASTZ (http://www.bx.psu.edu/~rsharris/lastz/ )? It is meant to align chromosome sized entities? – GenoMax Jun 14 '17 at 23:48

1 Answers1

9

Whole genome aliment can be done using Progressive Mauve, LAST or Mummer. For bacteria I used Mauve since it has also very nice visualisation engine. A very new tool is Minimap2, a super fast mapper that supposed beside read mapping be able to handle reference vs reference. However, I do not know how performance of it compares to the tools mentioned above.

If you are interested in a rough idea of the shared genome regions, you can use bevel. Bevel is not really an aligner, it is more like a dot-plot, but it is super fast (even for mammalian sized genomes).

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59