4

I have a lot of fasta files, each one with thousand of reads containing the hexameric motif "CCCTCT". The hexameric motif is highly continuous in most cases but interruptions may occur. I need to count the hexameric motif keeping the read ID and identify their structure (CCCTCT)n. These interruptions are real and do not come from Sequencing error

Example Sequence:(interruption in bold)

ACATTTTTTTTCCACATCTGATGTGGAAAAAAAAAAAAATGAAATAGCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCTCCCCACGGTCTCCCTCTCATGCGGAGCCAAAGCTGGACTGTACTGCT

Strategy 1: Count the CCCTCT motif

grep -on "CCCTCT" fasta.fa | cut -f 1 -d ":" | sort | uniq -c

Output: 23 CCCTCT

As you can see we have one interruption (CCT) in this sequence which the codes doesn't detect.

Strategy 2: Identify reads with the same structure and count repeats in each read

from Bio import SeqIO
from collections import defaultdict

dedup_records = defaultdict(list) for record in SeqIO.parse("fasta.fa", "fasta"): # Use the sequence as the key and then have a list of id's as the value dedup_records[str(record.seq)].append(record.id) with open("Output.fasta", 'w') as output: for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True): # Join the ids and write them out as the fasta output.write(">{}_counts{}\n".format(ids[0], len(ids))) output.write(seq + "\n")

Output: list ranking of reads with identical structures but does not show the structure

>ID/ccs_counts2
CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT.....

Can you please give me an idea of how to obtain the following output? example of a sequence of 23 hexamers with interruptions.

(CCCTCT)20 [INTERRUPTION] (CCCTCT)3

I have used expansion hunter, repeatFinder and repeatAnalysis-tool by PacBio but they were not particularly sensitive for this case. Any help would be appreciated.

Daniel Standage
  • 5,080
  • 15
  • 50
Amaru
  • 41
  • 2
  • I've also mulled on this sort of puzzle in the past (when dealing with microsatellites) and I feel like there should be some elegant approaches that could fit. How strict do you want the definition of a repeat region and interruption between repeats? For example, more than one perfect tandem repeat triggers a match, and less than one repeat-length mismatch before it resumes counts as an interruption (rather than the end of the whole repeat)? Or even simpler, just however many perfect matches with any-length interruptions between? – Jesse Nov 05 '21 at 17:10
  • Thanks for your thoughts. I have to be either flexible to detect some changes (if there is some) and strict to be able to report those sequences following the patterns. I guess I just figured it out. – Amaru Nov 08 '21 at 18:16

1 Answers1

1

I was going to write some clever code counting k-mers, but this problem can easily be solved using the handy sequentially-operating feature of python's str.replace() method.

This code doesn't work if you need to look at the motif occurring on the other strand. Just the forward strand.

#!/usr/bin/env python3

mystring = "CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCC
TCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCT
CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT"

motif = "CCCTCT"

This replaces the motif with 1s

newstring = mystring.replace(motif, "1")

results in 1111111111111111111111111111111111111CCT11

silly, right?

This whole block just parses that string and prints everything

the way that OP wanted.

done = False space="" accumulator = 0 while not done: if len(newstring) == 0: done = True break else: if newstring[0] == "1": accumulator += 1 else: if accumulator != 0: print("{}({}){} ".format(space, motif, accumulator), end = "") accumulator = 0 print(newstring[0], end = "") space = " " newstring = newstring[1::]

This is our check after exiting the while loop.

This only fires if we stepped off cleanly at the

end of a motif.

if accumulator > 0: print("{}({}){} ".format(space, motif, accumulator), end = "") print()

Prints, (CCCTCT)37 CCT (CCCTCT)2, matching OP's spec.

conchoecia
  • 3,141
  • 2
  • 16
  • 40
  • That looks good, but what if I have thousand of reads so I can use fasta/bam files as input. This looks fine when I have a read, am I right? – Amaru Nov 19 '21 at 19:59
  • Not sure what you mean by "this looks fine when I have a read", but this code can be put into a method and worked into your framework you already have. – conchoecia Nov 21 '21 at 00:56