9

Is there a convenient way to extract the longest isoforms from a transcriptome fasta file? I had found some scripts on biostars but none are functional and I'm having difficulty getting them to work.

I'm aware that the longest isoforms aren't necessarily 'the best' but it will suit my purposes.

The fasta was generated via Augustus. Here is what the fasta file looks like currently (sequence shortened to save space)

>Doug_NoIndex_L005_R1_001_contig_2.g7.t1
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgattcgtatgatcaactcatca
ttaatataaccaaaaacctagaaattctagccttcgatgatgttgcagctgcggttcttgaagaagaaagtcggcgcaagaacaaagaagatagaccg
>Doug_NoIndex_L005_R1_001_contig_2.g7.t2
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgattcgtatgatcaactcatca

The format is as such:

Gene 1 isoform 1  
Gene 1 isoform 2  
Gene 2 isoform 1  
Gene 2 isoform 2   

and so forth. There are several genes that have more than one pair of isoforms (up to 3 or 4). There are roughly 80,000 total transcripts, probably 25,000 genes. I would like to extract the single longest isoform for each gene.

gringer
  • 14,012
  • 5
  • 23
  • 79
ZincFingers
  • 301
  • 3
  • 7
  • It depends on the exact format of your fasta. Where did you get it? What do the fasta header lines look like? Can you paste a small example? – user172818 Jun 08 '17 at 21:38
  • Doug_NoIndex_L005_R1_001_contig_4.g13.t1

    (sequence data here)

    Doug_NoIndex_L005_R1_001_contig_4.g13.t2

    (sequence data here)

    – ZincFingers Jun 08 '17 at 21:38
  • What exactly have you tried? 2. What does them not being functional mean? 3. Do you just want the single longest one or the top N longest?
  • – Devon Ryan Jun 08 '17 at 21:38
  • I've tried all of the solutions here: https://www.biostars.org/p/107759/

  • Non functional in that I couldn't get it work

  • I would like the single longest read for each set of isoforms

  • – ZincFingers Jun 08 '17 at 21:41
  • @ZincFingers: Please put all of this in your original post and include whether isoforms are ever separated from each other by unrelated genes. – Devon Ryan Jun 08 '17 at 21:44
  • You don't have the same sequence IDs there. Should we assume that the part after the last . in the ID line needs to be discarded? – terdon Jun 08 '17 at 21:58
  • I'm not sure what you mean. The ".t1" or ".t2" indicates which isoform the read is from, I would like to discard the shorterst isoforms and retain the longest isoform. – ZincFingers Jun 08 '17 at 22:02
  • OK, so they do need to be removed. – terdon Jun 08 '17 at 22:03