Changing the record id in a FASTA file using BioPython

Question

I have the following FASTA file, original.fasta:

>foo
GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

I need to change the record id from foo to bar, so I wrote the following code:

from Bio import SeqIO

original_file = r"path\to\original.fasta"
corrected_file = r"path\to\corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id             # prints 'foo'
        if record.id == 'foo':
            record.id = 'bar'
        print record.id             # prints 'bar' as expected
        SeqIO.write(record, corrected, 'fasta')

We printed the record id before and after the change, and get the expected result. We can even doublecheck by reading in the corrected file again with BioPython and printing out the record id:

with open(corrected_file) as corrected:
    for record in SeqIO.parse(corrected, 'fasta'):
        print record.id                  # prints 'bar', as expected

However, if we open the corrected file in a text editor, we see that the record id is not bar but bar foo:

>bar foo
GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

We can confirm that this is what is written to the file if we read the file using plain Python:

with open(corrected_file) as corrected:
    print corrected.readlines()[0][1:] # prints 'bar foo'

Is this a bug in BioPython? And if not, what did I do wrong and how do I change the record id in a FASTA file using BioPython?

Keep in mind that record.id is not equal to the full record header, as record.description is a second part of it. If you parse the headers and they contain two space separated entries, record.id will only print the first one. — Jenez, May 31 '17 at 13:58
In addition, it seems you open the original file unnecessarily since you never do anything with the original object. — Jenez, May 31 '17 at 14:04
I think @Jenez is completely correct; Also you asked the same question on Biostars and got the same answer https://www.biostars.org/p/156428/ — Chris_Rands, May 31 '17 at 14:16
@Chris_Rands yes, but we need content here so BioGeek was nice enough to ask again so we can have another answered post! Now the next person who comes here with a similar issue will find an answer. Yay! — terdon, May 31 '17 at 14:24
Don’t write paths as r"foo\bar", this only works on Windows. Write them as "foo/bar" (no raw string needed), this works everywhere (including Windows). — Konrad Rudolph, May 31 '17 at 14:35
@terdon Because I felt it didn't accurately answer the question fully. It was more of a nudge in the right direction. — Jenez, May 31 '17 at 14:44
Regarding file paths, I suppose it is even better to use os.path.join, which automatically choses the correct separator. — bli, May 31 '17 at 15:55

score 11 · Accepted Answer · edited May 31 '17 at 19:39

If I use SeqIO.parse(filehandle, 'fasta') to parse a FASTA file, then it will return a SeqRecord object where the id and name are the first word (everything before the first whitespace) of the line beginning with > and the description is the complete line (all not including the initial >). (This behaviour can overruled by providing a custom title2ids function).

So, if I for example have a original.fasta file like:

>Lorem|ipsum>dolor sit amet
GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

Then I will get:

>>> from Bio import SeqIO
>>> path = r'C:\path\to\original.fasta'
>>> records = SeqIO.parse(open(path), 'fasta')
>>> record = next(records)
>>> record.id
'Lorem|ipsum>dolor'
>>> record.name
'Lorem|ipsum>dolor'
>>> record.description
'Lorem|ipsum>dolor sit amet'

So, what happens in my case when you do

    if record.id == 'foo':
        record.id = 'bar'

is that the record.id is successfully changed from foo to bar, but that the record.description is not changed and stays foo. That's why, when the FASTA file is printed out, I see the described behaviour of

>bar foo
GCTCACACATAGTTGATGCAGATGTTGAATTCACTATGAGGTGGGAGGATGTAGGGCCA

So, the solution to my problem is to both change the id AND the description:

from Bio import SeqIO

original_file = r"path\to\original.fasta"
corrected_file = r"path\to\corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        if record.id == 'foo':
            record.id = 'bar'
            record.description = 'bar'
        SeqIO.write(record, corrected, 'fasta')

with open(corrected_file) as corrected:
    records = SeqIO.parse(corrected, 'fasta')
    for record in records:
        print record.id             # prints bar
        print record.name           # prints bar
        print record.description    # prints bar

Changing the record id in a FASTA file using BioPython

1 Answers1

Linked