3

Suppose we have a fasta file like

>Seq1
GTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCAT
>Seq1
AAAAACGAAACTGTATCCCGTGTTT
>Seq2
CGTGTTTAGCAAAGAAAT

I want to produce

>Seq1
GTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCATAAAAACGAAACTGTATCCCGTGTTT
>Seq2
CGTGTTTAGCAAAGAAAT
sksahu
  • 51
  • 5

4 Answers4

4

This is a scenario where awk excels and yields a much simpler solution than the other examples here (though the logic is essentially equivalent to terdon’s Perl solution):

awk '
    />/  { id = $0 }
    !/>/ { seq[id] = seq[id] $0 }
    END  { for (id in seq) print id "\n" seq[id] }
' input.fasta \
> output.fasta

That said, for anything beyond a quick’n’dirty application I strongly recommend using a proper FASTA parser, not an ad-hoc solution.

Konrad Rudolph
  • 4,845
  • 14
  • 45
3

Certainly this python script is not the most elegant solution but it will give you the desired result:

import sys

seqs = {}
with open(sys.argv[1], "r") as fh:
    curr = ""
    for line in fh:
        if line.startswith(">"):
            curr = line
            if curr not in seqs:
                seqs[curr] = ""
        else:
            seqs[curr] += line.strip()

for ident, seq in seqs.items():
    print("{}{}".format(ident, seq))

If you saved the script in seq_merge.py just run it with:

python seq_merge.py your_file.fasta > result.fasta
phngs
  • 166
  • 3
1

assuming two lines per fasta record (or you can linearize) and using datamash:

$ cat jeter.fa | paste - - | sort -t $'\t' -k1,1 | datamash -g 1 collapse 2 | tr -d ',' | tr "\t" "\n"

>Seq1
AAAAACGAAACTGTATCCCGTGTTTGTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCAT
>Seq2
CGTGTTTAGCAAAGAAAT
Pierre
  • 1,536
  • 7
  • 11
1

If you don't care about the order of the sequences and your file is small enough to fit in RAM, you can do:

$ perl -nle 'if(/^>/){
                $n=$_
             }
             else{
                $seq{$n}.=$_
            } 
            END{
                print "$_\n$seq{$_}" for keys(%seq)
            }' file.fa 
>Seq1
GTTGAGAGGTGTATGGACACGAAAAACGAAACTGTATCCCGTGTTTAGCAAAGAAATCATAAAAACGAAACTGTATCCCGTGTTT
>Seq2
CGTGTTTAGCAAAGAAAT

Or, if you are into the whole brevity thing1 :

perl -nle '/^>/?($n=$_):($s{$n}.=$_);}{print"$_\n$s{$_}"for keys%s' file.fa 
terdon
  • 10,071
  • 5
  • 22
  • 48