How to append numbers only on duplicates sequence names?

Question

I have a reference database with contains 100s of sequences in fasta format. Some of these sequences have duplicate names like so:

>1_uniqueGeneName
atgc
>1_anotherUniqueGeneName
atgc
>1_duplicateName
atgc
>1_duplicateName
atgc

Is is possible to run through a large file like this and change the names of only the duplicates?

>1_uniqueGeneName
atgc
>1_anotherUniqueGeneName
atgc
>1_duplicateName_1
atgc
>1_duplicateName_2
atgc

Do you want to only change the names of the duplicates, I mean of the second, third etc seqs with the same name or also of the 1st as you have done? I mean, do you really need duplicate_1, duplicate_2 or is duplicate, duplicate_1 enough? The latter is far simpler since the former will require you to read the file twice. — terdon, Jul 04 '17 at 11:57
Cool. Id prefer the duplicate_1 duplicate_2format but if it is much more complex then I am fine with the duplicate duplicate_1format. — AudileF, Jul 04 '17 at 12:32
@AudileF: I'm having a hard time coming up with a good reason why I would want the first duplicate to be changed as well. Can you give us a bit more insight into the reason for this request? — winni2k, Jul 04 '17 at 16:08

Konrad Rudolph · Answer 1 · 2017-07-04T17:14:57.440

7

And here’s a solution using R (with the Bioconductor):

fa = ShortRead::readFasta(infile)
ids = as.character(ShortRead::id(fa))
fa@id = Biostrings::BStringSet(make.unique(ids, sep = '_'))
ShortRead::writeFasta(fa, outfile)

edited Jul 04 '17 at 17:14

answered Jul 04 '17 at 13:10

Konrad Rudolph

4,845
14
45

Devon Ryan · Answer 2 · 2017-07-05T15:49:58.337

Sure, using biopython:

#!/usr/bin/env python
from Bio import SeqIO

records = set()
of = open("output.fa", "w")
for record in SeqIO.parse("foo.fa", "fasta"):
    ID = record.id
    num = 1
    while ID in records:
        ID = "{}_{}".format(record.id, num)
        num += 1
    records.add(ID)
    record.id = ID
    record.name = ID
    record.description = ID
    SeqIO.write(record, of, "fasta")
of.close()

Change output.fa and foo.fa. One doesn't need to explicitly change the .name and .description, but that's handy to prevent the original ID from not appearing still (after a space).

Explanation

records = set() This will create a lookup table of all IDs written
for record in SeqIO.parse("foo.fa", "fasta"): Open foo.fa as a fasta file and iterate over the entries in it. These are objects with an ID, name, description, and sequence attribute (the name and description are the same as the ID if not present).
ID = record.id Memoize the entry ID (e.g., 1_duplicateName)
while ID in records: As long as the ID has already been seen, keep looping.
ID = "{}_{}".format(record.id, num) Start adding increasing numbers after the ID, such as 1_duplicateName_1 and 1_duplicateName_2. This will continue until the ID has not been seen.
records.add(ID) Add the unseen ID to the set.
record.id = ID Update the ID, the .name and .description are the same. If you don't do that, then you get output like >1_duplicateName_1 1_duplicateName. That's not really a problem, but it's excessive.
SeqIO.write(record, of, "fasta") Write the record to the output file in fasta format.

A benefit of biopython is the easy ability to change formats, so instead of fasta one could have substituted Genbank or another format if needed.

terdon · Accepted Answer · 2017-07-04T13:59:59.517

Sure, this little Perl snippet should do it:

$ perl -pe 's/$/_$seen{$_}/ if ++$seen{$_}>1 and /^>/; ' file.fa 
>1_uniqueGeneName
atgc
>1_anotherUniqueGeneName
atgc_2
>1_duplicateName
atgc_3
>1_duplicateName_2
atgc_4

Or, to make the changes in the original file, use -i:

perl -i.bak -pe 's/$/_$seen{$_}/ if ++$seen{$_}>1 and /^>/; ' file.fa

Note that the first occurrence of a duplicate name isn't changed, the second will become _2, the third _3 etc.

Explanation

perl -pe : print each input line after applying the script given by -e to it.
++$seen{$_}>1 : increment the current value stored in the hash %seen for this line ($_) by 1 and compare it to 1.
s/$/_$seen{$_}/ if ++$seen{$_}>1 and /^>/ : if the current line starts with a > and the value stored in the hash %seen for this line is greater than 1 (if this isn't the first time we see this line), replace the end of the line ($) with a _ and the current value in the hash

Alternatively, here's the same idea in awk:

$ awk '(/^>/ && s[$0]++){$0=$0"_"s[$0]}1;' file.fa 
>1_uniqueGeneName
atgc
>1_anotherUniqueGeneName
atgc
>1_duplicateName
atgc
>1_duplicateName_2
atgc

To make the changes in the original file (assuming you are using GNU awk which is the default on most Linux versions), use -i inplace:

awk -iinplace '(/^>/ && s[$0]++){$0=$0"_"s[$0]}1;' file.fa

Explanation

In awk, the special variable $0 is the current line.

(/^>/ && s[$0]++) : if this line starts with a > and incrementing the value stored in the array s for this line by 1 evaluates to true (is greater than 0).
$0=$0"_"s[$0] : make the current line be itself with a _ and the value from s appended.
1; : this is just shorthand for "print this line". If an expression evaluates to true, awk will print the current line. Since 1 is always true, this will print every line.

If you want all of the duplicates to be marked, you need to read the file twice. Once to collect the names and a second to mark them:

$ awk '{
    if (NR==FNR){
        if(/^>/){
            s[$0]++
        }
        next;
    }
    if(/^>/){
        k[$0]++;
        if(s[$0]>1){
            $0=$0"_"k[$0]
        }
    }
    print
}' file.fa file.fa
>1_uniqueGeneName
atgc
>1_anotherUniqueGeneName
atgc
>1_duplicateName_1
atgc
>1_duplicateName_2
atgc

IMPORTANT: note that all of these approaches assume you don't already have sequence names ending with _N where N is a number. If your input file has 2 sequences called foo and one called foo_2, then you will end up with two foo_2:

$ cat test.fa
>foo_2
actg
>foo
actg
>foo
actg
$ perl -pe 's/$/_$seen{$_}/ if ++$seen{$_}>1 and /^>/; ' test.fa 
>foo_2
actg
>foo
actg
>foo_2
actg

If this can be an issue for you, use one of the more sophisticated approaches suggested by the other answers.

In the awk example I worry that duplicate IDs could be created if you have contigs like >foo and >foo_1 in the original file. foo_1 wouldn't be in s, but might already have been printed out if foo had already been encountered. — Devon Ryan, Jul 04 '17 at 13:37
@DevonRyan ah, yes that's a very valid point. Not just the awk one either, all approaches here break if a duplicate sequence already exists with a _N. In that case, the extra verbiage added by python is actually worth it :P. I added a disclaimer, thanks. — terdon, Jul 04 '17 at 13:56

How to append numbers only on duplicates sequence names?

3 Answers3

Explanation

Explanation

Explanation