python 3 solution
This python program will get you what I think you want. It is not elegant given that it (A) stores the whole sequence in memory and (B) spends most of its time slicing strings rather than working with a buffer. This program is not recommended for FASTA files for which the largest sequence will not fit well in RAM.
All that you need to do to modify it for other genomes or other binding sites is to change the three lines that begin with forseq, revseq, and filename.
from Bio import SeqIO
from Bio.Seq import Seq
forseq = Seq("tttaca".upper())
revseq = Seq("gtcttt".upper())
filename = "NC_003030.fasta"
forcom = forseq.reverse_complement()
revcom = revseq.reverse_complement()
print("seq\tstart\tstop\tgap\tdir\tseq")
for record in SeqIO.parse(filename, "fasta"):
for i in range(len(record.seq)):
for v1, v2, direc in [ (forseq, revseq, "F"), (revcom, forcom, "R") ]:
if record.seq[i:i+len(v1)] == v1:
for j in range(10,25):
if record.seq[i+len(v1)+j:i+len(v2)+j+len(v2)] == v2:
print("{}\t{}\t{}\t{}\t{}\t{}".format(
record.id, i,
i + len(v1)+j+len(v2)-1,
j, direc,
record.seq[i: i+len(v1) + j + len(v2)] ))
Here are the first few lines of example output from the first fasta file that you linked.
seq start stop gap dir seq
NC_003030.1 97512 97543 20 F TTTACAAAAGAAAAGATATGCATAATGTCTTT
NC_003030.1 283173 283205 21 F TTTACACTTATGTTTATATAGTTCAACGTCTTT
NC_003030.1 303104 303125 10 R AAAGACGCAAGCATAATGTAAA
NC_003030.1 1127691 1127725 23 F TTTACAAGGTGTTACTGATAGAACAAAGGGTCTTT
NC_003030.1 1301958 1301988 19 F TTTACATAAAAAGATGATAGTTTTTGTCTTT
python 3 with redundancy
If you need sequence redundancy, this version works. I used the redundancy codes available from www.reverse-complement.com You may use the redundant characters in [ R, Y, S, W, K, M, B, V, D, H, N ] or lowercase variants. I'm not sure how to handle the '-' character at the moment but am open to suggestions. Code and results below.
Again, the algorithm isn't great given all the for loops so other answers are appreciated!
from Bio.Seq import Seq
forseq = "tttaca" # Just use python strings here. No Seq() as above
revseq = "ntnntn"
filename = "NC_003030.fasta"
revcomp = {'A':'T', 'C':'G', 'G':'C', 'T':'A',
'R':'Y', 'Y':'R', 'S':'S', 'W':'W',
'K':'M', 'M':'K', 'B':'V', 'V':'B',
'D':'H', 'H':'D', 'N':'N'}
redund = {'A':['A'], 'C':['C'], 'G':['G'], 'T':['T'],
'R':['A','G'], 'Y':['C','T'],
'S':['C','G'], 'W':['A','T'],
'K':['G','T'], 'M':['A','C'],
'B':['C','G','T'], 'V':['A','C','G'],
'D':['A','G','T'], 'H':['A','C','T'],
'N':['A','C','G','T']}
def revcom(seq):
return "".join(revcomp.get(base, base) for base in reversed(seq))
def seqmatch(query, reference):
"""Checks if the query sequence (perhaps a promoter with redundant bases)
matches the reference. Both query and reference must be the same length.
Returns true if there is a match (redundancy allowed), returns false
otherwise.
"""
assert len(query) == len(reference)
for i in range(len(query)):
if reference[i] in redund[query[i]]:
pass
else:
return False
return True
forseq = forseq.upper()
revseq = revseq.upper()
forcom = revcom(forseq)
revcom = revcom(revseq)
print("seq\tstart\tstop\tgap\tdir\tseq")
for record in SeqIO.parse(filename, "fasta"):
for i in range(len(record.seq)):
for v1, v2, direc in [ (forseq, revseq, "F"), (revcom, forcom, "R") ]:
if seqmatch( v1, record.seq[i:i+len(v1)] ):
for j in range(10,25):
if seqmatch( v2, record.seq[i+len(v1)+j:i+len(v2)+j+len(v2)] ):
print("{}\t{}\t{}\t{}\t{}\t{}".format(
record.id, i,
i + len(v1)+j+len(v2)-1,
j, direc,
record.seq[i: i+len(v1) + j + len(v2)] ))
Here is some sample output:
seq start stop gap dir seq
NC_003030.1 733 767 23 R GATGAAGATCAAGAAACCGATACAAACAATGTAAA
NC_003030.1 736 767 20 R GAAGATCAAGAAACCGATACAAACAATGTAAA
NC_003030.1 739 767 17 R GATCAAGAAACCGATACAAACAATGTAAA
NC_003030.1 742 767 14 R CAAGAAACCGATACAAACAATGTAAA
NC_003030.1 743 767 13 R AAGAAACCGATACAAACAATGTAAA
NC_003030.1 2665 2697 21 R TATTAAAATTTTAGAGGATGATGGTGATGTAAA
NC_003030.1 2668 2697 18 R TAAAATTTTAGAGGATGATGGTGATGTAAA
NC_003030.1 3798 3822 13 R CACCAGAAGACTTAAAAATTGTAAA
NC_003030.1 3801 3822 10 R CAGAAGACTTAAAAATTGTAAA