7

I am trying to use the seqkit replace command to replace chromosome names in the format chr_I, chr_II, ... to I, II, .... I am using the following command:

seqkit replace -p "(.)" --replacement "{kv}" --kv-file keyValues.txt mySequence.fasta

My keyValues.txt file contains the following:

chr_I   I
chr_II  II
...

The two columns are separated by a tabulation.

I get the following output:

[INFO] read key-value file: keyValues.txt
[INFO] 6 pairs of key-value loaded
[ERRO] pattern "(.)" matches multiple targets in "chr_I", this will cause chaos

I don't know what I am doing wrong. I chose the (.) pattern to match the whole header but it seems to be wrong. Any help would be appreciated.

Update

All the headers are shown below:

grep '>' mySequence.fasta
>chr_I
>chr_II
>chr_III
>chr_IV
>chr_V
>chr_X
Biomagician
  • 2,459
  • 16
  • 30

4 Answers4

10

Why not use sed instead?

sed -e 's/chr_I/I/' -e 's/chr_V/V/' -e 's/chr_X/X/' mySequence.fasta > mySeq.fasta

Or even simpler:

sed 's/chr_//' mySequence.fasta > mySeq.fasta
benn
  • 3,571
  • 9
  • 28
  • 1
    Worked. I just don't know sed. I need to learn it. – Biomagician Jun 21 '18 at 12:12
  • 1
    Great! It works a bit like grep (which you know how to use), but then with replacement function. – benn Jun 21 '18 at 12:16
  • 2
    I would suggest using sed 's/^chr_//' instead. Anchoring the pattern to the beginning of the line should speed things up for large files. Otherwise, sed will have to scan the entire length of each line in case there's a match. – terdon Jun 21 '18 at 16:00
  • 2
    @terdon, good point. But does it work with fasta? Remember the > at the beginning of the header. – benn Jun 21 '18 at 17:21
  • 3
    @b.nota Duh! I should learn to read :) Indeed, sed 's/^>chr_/>/ would be better. – terdon Jun 21 '18 at 22:09
  • Minor correction for the last: sed 's/^>chr_/>/' ( in order to keep the >) – Sebastian Müller Jun 22 '18 at 05:46
2

I like using awk when there's any kind of lookup involved: something like

awk 'FNR==NR {
    hash[">"$1]=$2;
    next;
}

if ($1 ~ /^>/) {
    print ">"hash[$1];
} else {
    print $1;
}' keyValues.txt mySequence.fasta

Basically the FNR==NR check tells awk to work only on the first file (useful primer here), keyValues.txt and create an association for each key (eg. chr_I) with its value (eg. I). The rest of the code after the next works only on mySequence.fasta, printing out the lookup value only if the line is a fasta header, as checked by the $1 ~ /^>/ condition.

flatley176
  • 126
  • 5
1

Your issue is that the pattern . matches a single character. To match the whole header, use .*:

seqkit replace -p "(.*)" --replacement "{kv}" --kv-file keyValues.txt mySequence.fasta

Alternatively as pointed out by others you may not really need keyValues.txt here as you're just removing the chr_ prefix:

seqkit replace -p "chr_(.*)" --replacement '$1' mySequence.fasta

(Don't forget the single quotes for $1 else the shell will expand it)

bricoletc
  • 161
  • 3
1

Here is an alternative to sed: cat test | tr -d "chr_"

The pipe is to output the file, while the tr command is to translate and change the file

llrs
  • 4,693
  • 1
  • 18
  • 42
aerijman
  • 645
  • 5
  • 14