5

I am trying to read a fasta file, manipulate is in Python (using BioPython) and then write it back.

The format of my sequences is like:

>k119_5 flag=0 multi=141.0706 len=473
AGGTTAGTCAGCACCGTTTCCGTGGTGCTGCCTTTCGCTTCAAAACCGACGGCGTCTATTACTGCATCCACGCCCCGGTGACCTGCCGTTTGTTCAATAATTGACTGTGCCGGATCGCTGTCTTCATCAAAATTAATCGGGATCGCGCCGTAGCGGTCGGCGGCGAAATGCAAGCGGTAGGGATGATGATCAACAACAAAAATCTGTTCCGCACCGAGCAACCGTGCACAGGCGATTGTCAACAATCCCACAGGACCAGCACCATAGACTGC
AACGCTTGAACCTTGTTGGATCTGCGCATTTTTTGCTGCCTGCCATGCCGTTGGCAGAATATCAGAAAGGAAAAGCGCTTTATCATCTGAAAGCAAAGGCGGTACTTTAAACGGCCCCACATTCCCTTTAGGGACGCGGACATATTCCGCCTGCCCACCAGGAACGCCGCCATACAGGTGACTATAACCAAACAATGCCGC

which is the result of the assmebly using Megahit. While trying to read, it only recognizes k119_5 as the id, while I want the whole first line as the id of the sequence. Is vital information I want to keep!

Any ideas how to work with this?

1 Answers1

8

The description field in the SeqRecord object has the information you are looking for:

>> from Bio import SeqIO
>> s = SeqIO.read('genome.fasta', 'fasta') # single sequence fasta file
>> s.id
k119_5
>> s.description
k119_5 flag=0 multi=141.0706 len=473

Edit: as an aside, if you write a SeqRecord object using SeqIO's write method, the FASTA header should contain the whole original header.

mgalardini
  • 977
  • 7
  • 18