How to merge transcript sequence with same name in a FASTA file.

Question

Suppose we have a fasta file like

>Seq1
GTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCAT
>Seq1
AAAAACGAAACTGTATCCCGTGTTT
>Seq2
CGTGTTTAGCAAAGAAAT

I want to produce

>Seq1
GTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCATAAAAACGAAACTGTATCCCGTGTTT
>Seq2
CGTGTTTAGCAAAGAAAT

score 4 · Answer 1 · answered Jul 19 '18 at 12:52

This is a scenario where awk excels and yields a much simpler solution than the other examples here (though the logic is essentially equivalent to terdon’s Perl solution):

awk '
    />/  { id = $0 }
    !/>/ { seq[id] = seq[id] $0 }
    END  { for (id in seq) print id "\n" seq[id] }
' input.fasta \
> output.fasta

That said, for anything beyond a quick’n’dirty application I strongly recommend using a proper FASTA parser, not an ad-hoc solution.

phngs · Accepted Answer · 2018-07-19T05:11:33.500

Certainly this python script is not the most elegant solution but it will give you the desired result:

import sys

seqs = {}
with open(sys.argv[1], "r") as fh:
    curr = ""
    for line in fh:
        if line.startswith(">"):
            curr = line
            if curr not in seqs:
                seqs[curr] = ""
        else:
            seqs[curr] += line.strip()

for ident, seq in seqs.items():
    print("{}{}".format(ident, seq))

If you saved the script in seq_merge.py just run it with:

python seq_merge.py your_file.fasta > result.fasta

score 1 · Answer 3 · answered Jul 19 '18 at 07:11

assuming two lines per fasta record (or you can linearize) and using datamash:

$ cat jeter.fa | paste - - | sort -t $'\t' -k1,1 | datamash -g 1 collapse 2 | tr -d ',' | tr "\t" "\n"

>Seq1
AAAAACGAAACTGTATCCCGTGTTTGTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCAT
>Seq2
CGTGTTTAGCAAAGAAAT

score 1 · Answer 4 · answered Jul 19 '18 at 10:28

If you don't care about the order of the sequences and your file is small enough to fit in RAM, you can do:

$ perl -nle 'if(/^>/){
                $n=$_
             }
             else{
                $seq{$n}.=$_
            } 
            END{
                print "$_\n$seq{$_}" for keys(%seq)
            }' file.fa 
>Seq1
GTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCATAAAAACGAAACTGTATCCCGTGTTT
>Seq2
CGTGTTTAGCAAAGAAAT

Or, if you are into the whole brevity thing¹:

perl -nle '/^>/?($n=$_):($s{$n}.=$_);}{print"$_\n$s{$_}"for keys%s' file.fa

How to merge transcript sequence with same name in a FASTA file.

4 Answers4