Counting a specific consecutive character with its occurrence position and length

Question

I have a sequence file and want to count consecutive character "N", with the tandem's position of occurrence and its length. Say I have a file named mySequence.fasta like this:

>sequence-1 
ATCGCTAGCATNNNNNNNNNNNNNNCTAGCATCATGCNNNNNNATACGCATCACANNNNNNNNNCgcatATCAC

and anticipated output should be like this:

Position 12 N 15
Position 38 N 6
Position 56 N 9

Kindly help me to solve this by awk or sed providing my file name mySequence.fasta

Why are you tied to awk and sed? There are many tools in programming languages like python, perl and java that are designed for this. — Bioathlete, Aug 31 '17 at 14:24

benn · Answer 1 · 2017-08-31T11:20:13.063

2

I couldn't come up with a one liner but have a two liner instead...

To get the start positions of the N repeat:

grep N mySeq.fasta | grep -b -o N | awk -F: '{print $1}' |awk '$1!=p+1{print $1}{p=$1}'
12
38
56

To get the length of the repeats:

grep N mySeq.fasta | awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print length($i)}' 
14
6
9

With thanks to stack overflow.

edited Aug 31 '17 at 11:20

answered Aug 31 '17 at 11:13

benn

3,571
9
28

Counting a specific consecutive character with its occurrence position and length

1 Answers1