Is there a convenient way to extract the longest isoforms from a transcriptome fasta file? I had found some scripts on biostars but none are functional and I'm having difficulty getting them to work.
I'm aware that the longest isoforms aren't necessarily 'the best' but it will suit my purposes.
The fasta was generated via Augustus. Here is what the fasta file looks like currently (sequence shortened to save space)
>Doug_NoIndex_L005_R1_001_contig_2.g7.t1
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgattcgtatgatcaactcatca
ttaatataaccaaaaacctagaaattctagccttcgatgatgttgcagctgcggttcttgaagaagaaagtcggcgcaagaacaaagaagatagaccg
>Doug_NoIndex_L005_R1_001_contig_2.g7.t2
atggggcataacatagagactggtgaacgtgctgaaattctacttcaaagtctacctgattcgtatgatcaactcatca
The format is as such:
Gene 1 isoform 1
Gene 1 isoform 2
Gene 2 isoform 1
Gene 2 isoform 2
and so forth. There are several genes that have more than one pair of isoforms (up to 3 or 4). There are roughly 80,000 total transcripts, probably 25,000 genes. I would like to extract the single longest isoform for each gene.
Doug_NoIndex_L005_R1_001_contig_4.g13.t1
(sequence data here)
(sequence data here)
– ZincFingers Jun 08 '17 at 21:38I've tried all of the solutions here: https://www.biostars.org/p/107759/
Non functional in that I couldn't get it work
I would like the single longest read for each set of isoforms
.in the ID line needs to be discarded? – terdon Jun 08 '17 at 21:58