How do I rename fasta headers?

Question

I am trying to use the seqkit replace command to replace chromosome names in the format chr_I, chr_II, ... to I, II, .... I am using the following command:

seqkit replace -p "(.)" --replacement "{kv}" --kv-file keyValues.txt mySequence.fasta

My keyValues.txt file contains the following:

chr_I   I
chr_II  II
...

The two columns are separated by a tabulation.

I get the following output:

[INFO] read key-value file: keyValues.txt
[INFO] 6 pairs of key-value loaded
[ERRO] pattern "(.)" matches multiple targets in "chr_I", this will cause chaos

I don't know what I am doing wrong. I chose the (.) pattern to match the whole header but it seems to be wrong. Any help would be appreciated.

Update

All the headers are shown below:

grep '>' mySequence.fasta
>chr_I
>chr_II
>chr_III
>chr_IV
>chr_V
>chr_X

Show some of the headers, the pattern argument is probably not suitable. I am guessing that you'll get headers with duplicate names now (which will cause chaos). — benn, Jun 21 '18 at 10:38
Oh, so the key is a regular expression? I tried putting a dollar at the end of the key names, but I get the same error. — Biomagician, Jun 21 '18 at 12:10

benn · Accepted Answer · 2018-06-21T12:39:35.483

10

Why not use sed instead?

sed -e 's/chr_I/I/' -e 's/chr_V/V/' -e 's/chr_X/X/' mySequence.fasta > mySeq.fasta

Or even simpler:

sed 's/chr_//' mySequence.fasta > mySeq.fasta

edited Jun 21 '18 at 12:39

answered Jun 21 '18 at 12:09

benn

3,571
9
28

1

Worked. I just don't know sed. I need to learn it. – Biomagician Jun 21 '18 at 12:12
1

Great! It works a bit like grep (which you know how to use), but then with replacement function. – benn Jun 21 '18 at 12:16
2

I would suggest using sed 's/^chr_//' instead. Anchoring the pattern to the beginning of the line should speed things up for large files. Otherwise, sed will have to scan the entire length of each line in case there's a match. – terdon Jun 21 '18 at 16:00
2

@terdon, good point. But does it work with fasta? Remember the > at the beginning of the header. – benn Jun 21 '18 at 17:21
3

@b.nota Duh! I should learn to read :) Indeed, sed 's/^>chr_/>/ would be better. – terdon Jun 21 '18 at 22:09
Minor correction for the last: sed 's/^>chr_/>/' ( in order to keep the >) – Sebastian Müller Jun 22 '18 at 05:46

score 2 · Answer 2 · answered Jun 27 '18 at 17:41

I like using awk when there's any kind of lookup involved: something like

awk 'FNR==NR {
    hash[">"$1]=$2;
    next;
}

if ($1 ~ /^>/) {
    print ">"hash[$1];
} else {
    print $1;
}' keyValues.txt mySequence.fasta

Basically the FNR==NR check tells awk to work only on the first file (useful primer here), keyValues.txt and create an association for each key (eg. chr_I) with its value (eg. I). The rest of the code after the next works only on mySequence.fasta, printing out the lookup value only if the line is a fasta header, as checked by the $1 ~ /^>/ condition.

score 1 · Answer 3 · answered Apr 22 '23 at 07:55

Your issue is that the pattern . matches a single character. To match the whole header, use .*:

seqkit replace -p "(.*)" --replacement "{kv}" --kv-file keyValues.txt mySequence.fasta

Alternatively as pointed out by others you may not really need keyValues.txt here as you're just removing the chr_ prefix:

seqkit replace -p "chr_(.*)" --replacement '$1' mySequence.fasta

(Don't forget the single quotes for $1 else the shell will expand it)

score 1 · Answer 4 · edited Jun 25 '18 at 07:58

1

Here is an alternative to sed: cat test | tr -d "chr_"

The pipe is to output the file, while the tr command is to translate and change the file

edited Jun 25 '18 at 07:58

llrs

4,693
1
18
42

answered Jun 24 '18 at 17:38

aerijman

645
5
14

The use of the cat command is completely redundant here (and imposes a runtime overhead). Just do tr -d 'chr_' < test. – Konrad Rudolph Jun 25 '18 at 10:20

How do I rename fasta headers?

Update

4 Answers4