4

I have a sequence file and want to count consecutive character "N", with the tandem's position of occurrence and its length. Say I have a file named mySequence.fasta like this:

>sequence-1 
ATCGCTAGCATNNNNNNNNNNNNNNCTAGCATCATGCNNNNNNATACGCATCACANNNNNNNNNCgcatATCAC

and anticipated output should be like this:

Position 12 N 15
Position 38 N 6
Position 56 N 9

Kindly help me to solve this by awk or sed providing my file name mySequence.fasta

Daniel Standage
  • 5,080
  • 15
  • 50
user1414
  • 41
  • 1
  • 4
    Why are you tied to awk and sed? There are many tools in programming languages like python, perl and java that are designed for this. – Bioathlete Aug 31 '17 at 14:24

1 Answers1

2

I couldn't come up with a one liner but have a two liner instead...

To get the start positions of the N repeat:

grep N mySeq.fasta | grep -b -o N | awk -F: '{print $1}' |awk '$1!=p+1{print $1}{p=$1}'
12
38
56

To get the length of the repeats:

grep N mySeq.fasta | awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print length($i)}' 
14
6
9

With thanks to stack overflow.

benn
  • 3,571
  • 9
  • 28