Counting hexamers in fasta sequence and identify its structure (and interruptions)

Question

I have a lot of fasta files, each one with thousand of reads containing the hexameric motif "CCCTCT". The hexameric motif is highly continuous in most cases but interruptions may occur. I need to count the hexameric motif keeping the read ID and identify their structure (CCCTCT)n. These interruptions are real and do not come from Sequencing error

Example Sequence:(interruption in bold)

ACATTTTTTTTCCACATCTGATGTGGAAAAAAAAAAAAATGAAATAGCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCTCCCCACGGTCTCCCTCTCATGCGGAGCCAAAGCTGGACTGTACTGCT

Strategy 1: Count the CCCTCT motif

grep -on "CCCTCT" fasta.fa | cut -f 1 -d ":" | sort | uniq -c

Output: 23 CCCTCT

As you can see we have one interruption (CCT) in this sequence which the codes doesn't detect.

Strategy 2: Identify reads with the same structure and count repeats in each read

from Bio import SeqIO
from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("fasta.fa", "fasta"):
    # Use the sequence as the key and then have a list of id's as the value
    dedup_records[str(record.seq)].append(record.id)
with open("Output.fasta", 'w') as output:
    for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
        # Join the ids and write them out as the fasta
        output.write(">{}_counts{}\n".format(ids[0], len(ids)))
        output.write(seq + "\n")

Output: list ranking of reads with identical structures but does not show the structure

>ID/ccs_counts2
CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT.....

Can you please give me an idea of how to obtain the following output? example of a sequence of 23 hexamers with interruptions.

(CCCTCT)20 [INTERRUPTION] (CCCTCT)3

I have used expansion hunter, repeatFinder and repeatAnalysis-tool by PacBio but they were not particularly sensitive for this case. Any help would be appreciated.

I've also mulled on this sort of puzzle in the past (when dealing with microsatellites) and I feel like there should be some elegant approaches that could fit. How strict do you want the definition of a repeat region and interruption between repeats? For example, more than one perfect tandem repeat triggers a match, and less than one repeat-length mismatch before it resumes counts as an interruption (rather than the end of the whole repeat)? Or even simpler, just however many perfect matches with any-length interruptions between? — Jesse, Nov 05 '21 at 17:10
Thanks for your thoughts. I have to be either flexible to detect some changes (if there is some) and strict to be able to report those sequences following the patterns. I guess I just figured it out. — Amaru, Nov 08 '21 at 18:16

score 1 · Answer 1 · answered Nov 07 '21 at 03:31

I was going to write some clever code counting k-mers, but this problem can easily be solved using the handy sequentially-operating feature of python's str.replace() method.

This code doesn't work if you need to look at the motif occurring on the other strand. Just the forward strand.

#!/usr/bin/env python3
mystring = "CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCC

TCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCT

CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT"
motif = "CCCTCT"
This replaces the motif with 1s
newstring = mystring.replace(motif, "1")
results in 1111111111111111111111111111111111111CCT11
silly, right?
This whole block just parses that string and prints everything
the way that OP wanted.
done = False
space=""
accumulator = 0
while not done:
    if len(newstring) == 0:
        done = True
        break
    else:
        if newstring[0] == "1":
            accumulator += 1
        else:
            if accumulator != 0:
                print("{}({}){} ".format(space, motif, accumulator), end = "")
                accumulator = 0
            print(newstring[0], end = "")
            space = " "
    newstring = newstring[1::]
This is our check after exiting the while loop.
This only fires if we stepped off cleanly at the
end of a motif.
if accumulator > 0:
    print("{}({}){} ".format(space, motif, accumulator), end = "")
print()

Prints, (CCCTCT)37 CCT (CCCTCT)2, matching OP's spec.

That looks good, but what if I have thousand of reads so I can use fasta/bam files as input. This looks fine when I have a read, am I right? — Amaru, Nov 19 '21 at 19:59
Not sure what you mean by "this looks fine when I have a read", but this code can be put into a method and worked into your framework you already have. — conchoecia, Nov 21 '21 at 00:56

Counting hexamers in fasta sequence and identify its structure (and interruptions)

1 Answers1

This replaces the motif with 1s

results in 1111111111111111111111111111111111111CCT11

silly, right?

This whole block just parses that string and prints everything

the way that OP wanted.

This is our check after exiting the while loop.

This only fires if we stepped off cleanly at the

end of a motif.