Searching motifs in sequence and their frequencies

Question

This is a two part question. I am searching for a motif, and in that search I wanted to also find the total number of sequences in my FASTA file, but the code I wrote is not yielding that please see the attempt below. The second part is when I have found the matching motifs, as well as the frequencies of the matches how do i go about to exclude the reference sequence matches and only get matches and frequncies from sequence2-sequence5.

> Ref
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>Sequence1
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR
>Sequence2
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS
NAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKVFNQPKAKAKELGIEICLMYVEIE
KGESVQEELLKGLDNKNPKIIVACIETLRKALSEFGSKIISLKPIIKVLPKLFESRDKAV
RDEAKLFAIEIYRWNRDAVKHTLQNINSVQLKELEEEWVKLPTGAPKPSRFLRSQQELEA
KLEQQQSAGGDAEGGGDDGDEVPQVDAYELLDAVEILSKLPKDFYDKIEAKKWQERKEAL
EAVEVLVKNPKLEAGDYADLVKALKKVVGKDTNVMLVALAAKCLTGLAVGLRKKFGQYAG
HVVPTILEKFKEKKPQVVQALQEAIDAIFLTTTLQNISEDVLAVMDNKNPTIKQQTSLFI
ARSFRHCTSSTLPKSLLKPFCAALLKHINDSAPEVRDAAFEALGTALKVVGEKSVNPFLA
> Sequence3
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>Sequence5
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS
NAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKVFNQPKAKAKELGIEICLMYVEIE
KGESVQEELLKGLDNKNPKIIVACIETLRKALSEFGSKIISLKPIIKVLPKLFESRDKAV
RDEAKLFAIEIYRWNRDAVKHTLQNINSVQLKELEEEWVKLPTGAPKPSRFLRSQQELEA
KLEQQQSAGGDAEGGGDDGDEVPQVDAYELLDAVEILSKLPKDFYDKIEAKKWQERKEAL
EAVEVLVKNPKLEAGDYADLVKALKKVVGKDTNVMLVALAAKCLTGLAVGLRKKFGQYAG
HVVPTILEKFKEKKPQVVQALQEAIDAIFLTTTLQNISEDVLAVMDNKNPTIKQQTSLFI
ARSFRHCTSSTLPKSLLKPFCAALLKHINDSAPEVRDAAFEALGTALKVVGEKSVNPFLA

The code to search the peptides is below

import re
import collections
m_list =[]
from Bio import SeqIO
input_file = 'Sequences2.fasta'
num_sequences = 0
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    num_sequence=+1
    name, sequence = fasta.id, str(fasta.seq)
    matches=re.finditer(r"P[A-Z]{2}P..",sequence)
    for m in matches:
        match_id = f'{m.start()}_{m.end()}_{m.group()}'
        m_list.append(match_id)
        print(name, m.start(), m.end(), m.group())        
print("total sequences",num_sequences)
print("frequency",collections.Counter(m_list))

It yields the following output below, which I am happy with but now it doesn't give me the total number of sequences:

Ref 73 79 PFLPGK
Ref 281 287 PLPPKL
Ref 288 294 PSSPAH
Sequence3 73 79 PFLPGK
Sequence3 281 287 PLPPKL
Sequence3 288 294 PSSPAH
total sequences 0
frequency Counter({'73_79_PFLPGK': 2, '281_287_PLPPKL': 2, '288_294_PSSPAH': 2})

The total sequences should be 5. I used the code below to find out. I do not know where I went wrong since it is not giving me that on the more compact code above.

sequence = open('Sequences2.fasta')
x = 0
for line in sequence:
    if line.startswith('>'):
        x += 1
sequence.close()
print(x)

The second part of my question is how do I modify my code such that, the matches and frequencies of the reference sequence(Ref) are excluded. The idea would be to get an output such as the one below. I do not even know whether that is possible (A disclaimer this second part is a question on a tutorial that I don't know how to do. I am still learning)

Sequence3 73 79 PFLPGK
Sequence3 281 287 PLPPKL
Sequence3 288 294 PSSPAH
total sequences 5  #this would be the total number of sequences in file including the refrence
frequency Counter({'73_79_PFLPGK': 1, '281_287_PLPPKL': 1, '288_294_PSSPAH': 1}) #frequency of matches, excluding the reference

score 4 · Answer 1 · answered Dec 30 '22 at 16:03

for the second part re-arrange the for m in matches block to :

for m in matches:
    match_id = f'{m.start()}_{m.end()}_{m.group()}'


    print(name, m.start(), m.end(), m.group())  

    if num_sequences &gt; 1:       #### added if clause so that appends only after first sequence to m_list

            m_list.append(match_id)

score 3 · Accepted Answer · answered Dec 30 '22 at 15:18

The first part appears to be a simple bug if there is no copy/paste error in the code:

Original code,

num_sequence=0
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    num_sequence=+1 # < its here
print (num_sequence)

The code I'd recommend is

num_sequence=0
fasta_sequences = SeqIO.parse(input_file,'fasta')
for fasta in fasta_sequences:
    num_sequence += 1 # corrected
print (num_sequence)

I will look at the second part later.

Searching motifs in sequence and their frequencies

2 Answers2