How can longest isoforms (per gene) be extracted from a FASTA file?

Question

Is there a convenient way to extract the longest isoforms from a transcriptome fasta file? I had found some scripts on biostars but none are functional and I'm having difficulty getting them to work.

I'm aware that the longest isoforms aren't necessarily 'the best' but it will suit my purposes.

The fasta was generated via Augustus. Here is what the fasta file looks like currently (sequence shortened to save space)

>Doug_NoIndex_L005_R1_001_contig_2.g7.t1
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgattcgtatgatcaactcatca
ttaatataaccaaaaacctagaaattctagccttcgatgatgttgcagctgcggttcttgaagaagaaagtcggcgcaagaacaaagaagatagaccg
>Doug_NoIndex_L005_R1_001_contig_2.g7.t2
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgattcgtatgatcaactcatca

The format is as such:

Gene 1 isoform 1  
Gene 1 isoform 2  
Gene 2 isoform 1  
Gene 2 isoform 2

and so forth. There are several genes that have more than one pair of isoforms (up to 3 or 4). There are roughly 80,000 total transcripts, probably 25,000 genes. I would like to extract the single longest isoform for each gene.

It depends on the exact format of your fasta. Where did you get it? What do the fasta header lines look like? Can you paste a small example? — user172818, Jun 08 '17 at 21:38
Doug_NoIndex_L005_R1_001_contig_4.g13.t1

(sequence data here)

Doug_NoIndex_L005_R1_001_contig_4.g13.t2

(sequence data here) — ZincFingers, Jun 08 '17 at 21:38

score 6 · Accepted Answer · edited Jun 09 '17 at 08:06

While the solution from https://bioinformatics.stackexchange.com/users/96/daniel-standage should work (after adjusting for possible python3 incompatibility), the following is a shorter and less memory demanding method that uses biopython:

#!/usr/bin/env python
from Bio import SeqIO
import sys

lastGene = None
longest = (None, None)
for rec in SeqIO.parse(sys.argv[1], "fasta"):
    gene = ".".join(rec.id.split(".")[:-1])
    l = len(rec)
    if lastGene is not None:
        if gene == lastGene:
            if longest[0] < l:
                longest = (l, rec)
        else:
            lastGene = gene
            SeqIO.write(longest[1], sys.stdout, "fasta")
            longest = (l, rec)
    else:
        lastGene = gene
        longest = (l, rec)
SeqIO.write(longest[1], sys.stdout, "fasta")

If you saved this as filter.py, then filter.py original.fa > subset.fa would be the command to use.

score 3 · Answer 2 · answered Jun 09 '17 at 10:51

Here is a solution in R. Could get really slow with big files. Works for the example you posted.

library(Biostrings)

## read your fasta in as Biostrings object
fasta.s <- readDNAStringSet("sample.fa")

## get the read names (in your case it has the isoform info)
names.fasta <- names(fasta.s)

## extract only the relevant gene and isoform id (split name by the period symbol)
gene.iso <- sapply(names.fasta,function(j) cbind(unlist(strsplit(j,'\\.'))[2:3]))

## convert to good data.frame = transpose result from previous step and add relevant column names
gene.iso.df <- data.frame(t(gene.iso))
colnames(gene.iso.df) <- c('gene','isoform')

## and length of isoforms
gene.iso.df$width <- width(fasta.s)

## split data.frame into list with entry for each gene
gene.iso.df.split <- split(gene.iso.df,gene.iso.df$gene)

## optional to keep all the information but really you just need indices
##gene.iso.df.split.best <- lapply(gene.iso.df.split,function(x) x[order(x$width)[1],])

## pull out the longest isoform ID for each gene (in case of a tie just take the first one)
best.id <- sapply(gene.iso.df.split,function(x) row.names(x)[order(x$width)[1]])

## subset your original reads with the subset
fasta.s.best <- fasta.s[best.id]

## export new fastafile containing longest isoform per gene
writeXStringSet(fasta.s, filepath='sample_best_isoform.fasta')

Sam Nicholls · Answer 3 · 2017-06-13T00:51:03.047

Late to the party here, but I like to try to avoid writing scripts when some command line magic will do. It's good practice to index your FASTA so use it.

samtools faidx <myfasta.fa>

A little bit of awk and sort can determine the largest isoform of each gene (and they don't even have to be sorted by name in the source FASTA).

awk -F'[\t.]' '{print $1,$2,$3,$4}' <myfasta.fa>.fai | sort -k4nr,4 | sort -uk1,2 | cut -f1-3 -d' '| tr ' ' '.' > selection.ls

The -F[\t.] specifies both the fai tab delimiters, and the full stops in your contig names (Doug_NoIndex_L005_R1_001_contig_2.g7.t1) as delimiters for awk.
For each line we print four columns. $1 to $3 are the three components of your contig name (after splitting on the .), and column $4 is the column in the index for sequence length.
We use sort to organise the lines by $4 (the contig length). Reverse sort with -r so the longest genes appear first.
We then subsort by columns $1 and $2 (the contig and gene name (g1...)). -u then drops any repeat of column $1 and $2 pairs (that is, all lines after the first pair -- the longest length).
Finally, we reassemble $1 to $3 with cut and tr to give you a list of sequence names.

A loop with our old friend samtools faidx can put your chosen sequences out to a new FASTA.

while read contig;
    do samtools faidx <myfasta.fa> $contig >> selection.fa;
done < selection.ls

Don't forget to index it, they are useful!

samtools faidx selection.fa

Oh and if you wanted to remove the .tN part, just switch out that cut -f1-3 -d' ' for cut -f1-2 -d' ' — Sam Nicholls, Jun 13 '17 at 00:51

terdon · Answer 4 · 2017-06-10T07:35:23.737

An ex-coworker (Josep Avril) has written a couple of very useful little scripts that convert fasta to tbl (seqID<TAB>Sequence) and back again. These are extremely handy for this sort of thing (and are included at the end of this answer). Using them, you can convert your fasta to a one sequence per line format, keep the longest sequence with a simple awk script and convert back to fasta.

I am assuming that the part after the last . in the ID line should be removed (so that Doug_NoIndex_L005_R1_001_contig_2.g7.t1 and Doug_NoIndex_L005_R1_001_contig_2.g7.t2 both map to Doug_NoIndex_L005_R1_001_contig_2.g7). If that is a correct assumption, this should work for you:

$ FastaToTbl file.fa | sed 's/\.[^.]*\t/\t/' | 
    awk -F"\t" '{
                    if(length($2)>length(a[$1])){
                        a[$1]=$0
                    }
                }
                END{
                    for(i in a){
                        print a[i]
                    }
                }' | TblToFasta
>Doug_NoIndex_L005_R1_001_contig_2.g7 
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgat
tcgtatgatcaactcatcattaatataaccaaaaacctagaaattctagccttcgatgat
gttgcagctgcggttcttgaagaagaaagtcggcgcaagaacaaagaagatagaccg

One possible issue with that approach is that it needs to keep one sequence per ID in memory. If you have a huge fasta file and not much memory, that might be a problem. If that's the case, you can first sort the file to ensure that similar IDs always appear together and then print each line as soon as you find the next ID:

FastaToTbl file.fa | sed 's/\.[^.]*\t/\t/' | LC_ALL=C sort -t$'\t' -k1,1 |  
    awk -F"\t" '{
                    if(NR>1 && prev!=$1){
                        print a[$1]; 
                        prev=$1
                    } 
                    if(length($2)>length(a[$1])){
                        a[$1]=$0
                    }
                }
                END{
                    print a[$1]
                }' | TblToFasta

FastaToTbl

#!/usr/bin/awk -f
{
        if (substr($1,1,1)==">")
        if (NR>1)
                    printf "\n%s\t", substr($0,2,length($0)-1)
        else 
            printf "%s\t", substr($0,2,length($0)-1)
        else 
                printf "%s", $0
}END{printf "\n"}

TblToFasta

#! /usr/bin/awk -f
{
  sequence=$NF

  ls = length(sequence)
  is = 1
  fld  = 1
  while (fld < NF)
  {
     if (fld == 1){printf ">"}
     printf "%s " , $fld

     if (fld == NF-1)
      {
        printf "\n"
      }
      fld = fld+1
  }
  while (is <= ls)
  {
    printf "%s\n", substr(sequence,is,60)
    is=is+60
  }
}

score 1 · Answer 5 · edited Jun 09 '17 at 08:07

This solution is will work for your example and probably all Augustus-derived Fastas, but mileage will vary beyond that.

#!/usr/bin/env python
from __future__ import print_function
import sys

def parse_fasta(data):
    """Stolen shamelessly from http://stackoverflow.com/a/7655072/459780."""
    name, seq = None, []
    for line in data:
        line = line.rstrip()
        if line.startswith('>'):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))

isoforms = dict()
for defline, sequence in parse_fasta(sys.stdin):
    geneid = '.'.join(defline[1:].split('.')[:-1])
    if geneid in isoforms:
        otherdefline, othersequence = isoforms[geneid]
        if len(sequence) > len(othersequence):
            isoforms[geneid] = (defline, sequence)
    else:
        isoforms[geneid] = (defline, sequence)

for defline, sequence in isoforms.values():
    print(defline, sequence, sep='\n')

Save as longest.py, and invoke on the command line like so.

python longest.py < intput.fasta > output.fasta

The line geneid = '.'.join(defline[1:].split('.')[:-1]) is key. Let me break it down.

defline[1:]: ignore the first character of the defline (the > symbol)
defline[1:].split('.'): split the string on . symbol
defline[1:].split('.')[:-1]: ignore the last value after the split (the isoform name)
'.'.join(defline[1:].split('.')[:-1]): join the split values again

This value will be the same for all isoforms for the same gene, so we use it in the script to keep track of the longest sequence associated with each gene ID.

This could save a lot of memory and a bit of time if you write out and reset isoforms whenever geneid not in isoforms. OP says isoforms are always next to each other. — Devon Ryan, Jun 08 '17 at 22:10
Hi Daniel, I tried running this script and received the following error message:
File "longest.py", line 29 print(defline, sequence, sep='\n') ^ SyntaxError: invalid syntax — ZincFingers, Jun 08 '17 at 22:14
@ZincFingers: Change it to print(defline) and print(sequence) on separate lines. — Devon Ryan, Jun 08 '17 at 22:19

How can longest isoforms (per gene) be extracted from a FASTA file?

5 Answers5

Linked