I have a lot of fasta files, each one with thousand of reads containing the hexameric motif "CCCTCT". The hexameric motif is highly continuous in most cases but interruptions may occur. I need to count the hexameric motif keeping the read ID and identify their structure (CCCTCT)n. These interruptions are real and do not come from Sequencing error
Example Sequence:(interruption in bold)
ACATTTTTTTTCCACATCTGATGTGGAAAAAAAAAAAAATGAAATAGCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCTCCCCACGGTCTCCCTCTCATGCGGAGCCAAAGCTGGACTGTACTGCT
Strategy 1: Count the CCCTCT motif
grep -on "CCCTCT" fasta.fa | cut -f 1 -d ":" | sort | uniq -c
Output:
23 CCCTCT
As you can see we have one interruption (CCT) in this sequence which the codes doesn't detect.
Strategy 2: Identify reads with the same structure and count repeats in each read
from Bio import SeqIO
from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("fasta.fa", "fasta"):
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
with open("Output.fasta", 'w') as output:
for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
# Join the ids and write them out as the fasta
output.write(">{}_counts{}\n".format(ids[0], len(ids)))
output.write(seq + "\n")
Output: list ranking of reads with identical structures but does not show the structure
>ID/ccs_counts2
CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT.....
Can you please give me an idea of how to obtain the following output? example of a sequence of 23 hexamers with interruptions.
(CCCTCT)20 [INTERRUPTION] (CCCTCT)3
I have used expansion hunter, repeatFinder and repeatAnalysis-tool by PacBio but they were not particularly sensitive for this case. Any help would be appreciated.