How to isolate genes from whole genomes for phylogenetic tree analysis?

Question

I have 446 whole Klebsiella Pneumoniae genomes I want to build a phylogenetic tree from. After reading about constructing phylogenetic trees it seems the only option for large numbers of genomes is to isolate a gene with low variability from generation to generation and use this gene to build a tree. For example Lars Jensen recommends using "16S rRNA [or] all ribosomal-protein-coding genes" https://www.biostars.org/p/1930/. What program isolates these genes of interest from the whole genome fasta files and can put them into a multiple alignment file? Or outputs them in a formate ready for a multiple alignment program such as Muave? The reason I say a multiple alignment file is because this it the type of file most phylogenetic tree programs take (I.E. clonalframe).

Hi Daniel, I would be cautious to build a phylogeny on 16s rRNA sequences as informative variability is essentially nil. I would suggest a good starting point being concatenated MLST genes, and work from there. You have the whole genome seems wrong to focus on the least variable gene. — AudileF, Jul 05 '17 at 14:47
Yes. Using 16S makes sense for distant species but not for closely related ones. it is extremely likely that there will be next to no variability at all. — terdon, Jul 05 '17 at 15:15
If you haven't yet annotated the genomes, do so with Prokka as suggested in some of the other answers. Once you've got that, you can easily pull gene features out of a genbank by name using biopython, once you've identified a few sequences you want to use as the basis for your sequence typing. — Joe Healey, Jul 05 '17 at 21:44

score 3 · Answer 1 · edited Jul 05 '17 at 20:39

3

There are a lot of ways to do this. I suggest using Prokka/Roary to produce a core genome alignment. There's a useful tutorial on the Roary website:

for file in *
do
    prokka --kingdom Bacteria --outdir "${file%%.*}"  --genus Listeria --locustag "${file%%.*}" "$file"
    mv "${file%%.*}"/PROKKA_07052017.gff GFF/"${file%%.*}".gff # use current date
done
roary -f Alignment -e -n -v GFF/*.gff

Alignment/core_gene_alignment.aln can be used as input for phylogenetic analyses

edited Jul 05 '17 at 20:39

terdon

10,071
5
22
48

answered Jul 05 '17 at 14:50

heathobrien

1,816
7
16

Roary is my go to but you have no say in what come out in the alignment. If you want to select your gene content use blast as I suggest. Concatenate the output and use an alignment software. – AudileF Jul 05 '17 at 15:02
@terdon I added a code snippet to do the analysis. It can be used to isolate genes (plural). If the OP just wants to build a phylogenetic tree, there's no need to isolate specific genes. With almost 500 genomes from a single species there's unlikely to me much resolution from analyses of a single gene or from a handful of MLST genes. – heathobrien Jul 05 '17 at 20:39

AudileF · Answer 2 · 2017-07-05T20:13:16.533

3

Extract desired gene sequences using standalone blast

Simply provide a reference database with your desired output. Set up your command and away you go. You can set the search up with a for loop for a batch of sequences. Command may look like

for f in *.fasta; do
   f=$(basename $f .fasta)
   blastn \
   -outfmt "6 sseqid qseq %" \
   -query $f.fasta \
   -subject reference.fna \
   > out/$f.fas
done

Watch the output as blast will spit out the gene as detected in + or - sense. If you want to only gather positive sense use -strandoption. The default output I have here is tab output which requires a few sed commands to make into fasta.

sed -i \
-e 's/\s*$//g' \
-e 's/^/>/g'  \
-e 's/\s\+/\n/g' \
*.fas

Online alignment servers are an easy way to align small datasets e.g. EBI

edited Jul 05 '17 at 20:13

answered Jul 05 '17 at 14:54

AudileF

955
8
25

1

It might be worth mentioning that this approach work for species with no splicing (so it should be fine for the OP), but isn't very useful for eukaryotic organisms. Also, why would you run multiple blasts in a loop like this? Why not give a multifasta input file to blast and run it once? – terdon Jul 05 '17 at 15:17
In my experience the blast out uses the gene names from the reference. and does not append the original genome name. I go about a few QC/ parsing bits and then combine the sequences. – AudileF Jul 05 '17 at 15:29
I'm not sure what my "desired gene sequences" are supposed to be. What is my query? – Daniel Harris Jul 05 '17 at 15:30
1

Your query is the genome sequences you have $f will insert the name for each in a loop. your desired gene sequences will be in your reference file. This could be the 16s or MLST gene sequences. Whatever you want to build a phylogeny of. – AudileF Jul 05 '17 at 15:34
My question is more how can I find those MLST gene sequences in the first place? – Daniel Harris Jul 05 '17 at 15:36
1

A quick google brought this up http://bigsdb.pasteur.fr/klebsiella/klebsiella.html . Have a look about there is usually a file full of genes and alleles. In my experience one (allele) set of the 7 genes should suffice. – AudileF Jul 05 '17 at 15:40
1

@DanielHarris take one MLST gene and use that as a query to find the rest. – terdon Jul 05 '17 at 15:48
1

@AudileF blast can take a multifasta file as a query and the results for each sequence in the input file will be shown in a separate Query= section. The names are indeed the names taken from the input file, but that shouldn't be a problem as long as they're all unique. – terdon Jul 05 '17 at 15:49
So use one of these MLST genes such as from this link http://bigsdb.pasteur.fr/perl/bigsdb/bigsdb.pl?db=pubmlst_klebsiella_seqdef_public&page=downloadAlleles as a query. Two questions now... Do I use the entire locus file, such as gapA has many >gapA_#'s or just one >gapA_1? Second, how does the output of blast then be used to build a phylogenetic tree? Blast output is far from multiple alignment format I think. – Daniel Harris Jul 05 '17 at 15:57
@terdon must admit I never know. Does this work in the case of contig assemblies? – AudileF Jul 05 '17 at 20:01
@DanielHarris 1) use one copy of each of the 7 genes. 2) you are correct the output can be quite jumbled. I usually use the -strandoption to gather the + and - sense sequences. Split them by taxa name. Order sequences in each file by name. remove fasta headers and space. Create new fasta header of file name. Then concatenate them all. For simplicity you could use a online msa server like on ebi. Long winded and confusing. Im not a master computational biologist terdon might have a better suggestion. – AudileF Jul 05 '17 at 20:08

How to isolate genes from whole genomes for phylogenetic tree analysis?

2 Answers2