How to find all variable-length seqs with an exact 5' and 3' match in a FASTA file

Question

Context

I am interested in finding all of the promotors specific to a particular sigma factor. I have identified the -35 and -10 sites from the literature, bold denotes -10, -35, binding sites:

  -35 site               -10 site
5' TTTACAtatttatttcagacaacGTCTTT 3'
   ^^^^^^                 ^^^^^^

The CAPITALIZED and ^emphasized^ 5' and 3' sequence components are the conserved sequence that must have an exact match. The lower-case bases are the variable 10-25bp region for which the sequence does not matter.

The sigma factor is from C. difficile and I am particularly interested in searching C. acetobutylicum, however, it would also be interesting to see where else this sequence crops up. Obviously such a short sequence will arise often by random chance, so it is important to define the spacer region, which might be anywhere between 15-25 nucleotides and may be any sequence.

In particular, I would like to find these binding regions from this sequence and this one.

Desired output

A text file that contains the locations and sequences of all of the variable-width sequences as defined above. Something like:

seq   start  stop  gap  dir  seq
seq1  1      22    10   F    TTTACAagtcagatctGTCTTT
seq2  2      23    11   F    TTTACAagtcagactctGTCTTT
seq2  122    101   10   R    AAAGACatgctagccaTGTAAA

Thank you.

Is this the same as your other question "BLAST Promotor -35 to -10"? If so, please consider to close/delete the other one. Thanks. — user172818, Dec 12 '18 at 01:15

conchoecia · Accepted Answer · 2018-12-17T20:45:55.277

python 3 solution

This python program will get you what I think you want. It is not elegant given that it (A) stores the whole sequence in memory and (B) spends most of its time slicing strings rather than working with a buffer. This program is not recommended for FASTA files for which the largest sequence will not fit well in RAM.

All that you need to do to modify it for other genomes or other binding sites is to change the three lines that begin with forseq, revseq, and filename.

from Bio import SeqIO
from Bio.Seq import Seq

forseq = Seq("tttaca".upper())
revseq = Seq("gtcttt".upper())
filename = "NC_003030.fasta"

forcom = forseq.reverse_complement()
revcom = revseq.reverse_complement()
print("seq\tstart\tstop\tgap\tdir\tseq")
for record in SeqIO.parse(filename, "fasta"):
    for i in range(len(record.seq)):
        for v1, v2, direc in [ (forseq, revseq, "F"), (revcom, forcom, "R") ]:
            if record.seq[i:i+len(v1)] == v1:
                for j in range(10,25):
                    if record.seq[i+len(v1)+j:i+len(v2)+j+len(v2)] == v2:
                        print("{}\t{}\t{}\t{}\t{}\t{}".format(
                            record.id, i,
                            i + len(v1)+j+len(v2)-1,
                            j, direc,
                            record.seq[i: i+len(v1) + j + len(v2)] ))

Here are the first few lines of example output from the first fasta file that you linked.

seq          start    stop     gap  dir  seq
NC_003030.1  97512    97543    20   F    TTTACAAAAGAAAAGATATGCATAATGTCTTT
NC_003030.1  283173   283205   21   F    TTTACACTTATGTTTATATAGTTCAACGTCTTT
NC_003030.1  303104   303125   10   R    AAAGACGCAAGCATAATGTAAA
NC_003030.1  1127691  1127725  23   F    TTTACAAGGTGTTACTGATAGAACAAAGGGTCTTT
NC_003030.1  1301958  1301988  19   F    TTTACATAAAAAGATGATAGTTTTTGTCTTT

python 3 with redundancy

If you need sequence redundancy, this version works. I used the redundancy codes available from www.reverse-complement.com You may use the redundant characters in [ R, Y, S, W, K, M, B, V, D, H, N ] or lowercase variants. I'm not sure how to handle the '-' character at the moment but am open to suggestions. Code and results below.

Again, the algorithm isn't great given all the for loops so other answers are appreciated!

from Bio.Seq import Seq

forseq = "tttaca" # Just use python strings here. No Seq() as above
revseq = "ntnntn"
filename = "NC_003030.fasta"

revcomp = {'A':'T', 'C':'G', 'G':'C', 'T':'A',
           'R':'Y', 'Y':'R', 'S':'S', 'W':'W',
           'K':'M', 'M':'K', 'B':'V', 'V':'B',
           'D':'H', 'H':'D', 'N':'N'}
redund = {'A':['A'], 'C':['C'], 'G':['G'], 'T':['T'],
          'R':['A','G'], 'Y':['C','T'],
          'S':['C','G'], 'W':['A','T'],
          'K':['G','T'], 'M':['A','C'],
          'B':['C','G','T'], 'V':['A','C','G'],
          'D':['A','G','T'], 'H':['A','C','T'],
          'N':['A','C','G','T']}

def revcom(seq):
    return "".join(revcomp.get(base, base) for base in reversed(seq))

def seqmatch(query, reference):
    """Checks if the query sequence (perhaps a promoter with redundant bases)
    matches the reference. Both query and reference must be the same length.

    Returns true if there is a match (redundancy allowed), returns false
    otherwise.
    """
    assert len(query) == len(reference)
    for i in range(len(query)):
        if reference[i] in redund[query[i]]:
            pass
        else:
            return False
    return True

forseq = forseq.upper()
revseq = revseq.upper()
forcom = revcom(forseq)
revcom = revcom(revseq)
print("seq\tstart\tstop\tgap\tdir\tseq")
for record in SeqIO.parse(filename, "fasta"):
    for i in range(len(record.seq)):
        for v1, v2, direc in [ (forseq, revseq, "F"), (revcom, forcom, "R") ]:
            if seqmatch( v1, record.seq[i:i+len(v1)] ):
                for j in range(10,25):
                    if seqmatch( v2, record.seq[i+len(v1)+j:i+len(v2)+j+len(v2)] ):
                        print("{}\t{}\t{}\t{}\t{}\t{}".format(
                            record.id, i,
                            i + len(v1)+j+len(v2)-1,
                            j, direc,
                            record.seq[i: i+len(v1) + j + len(v2)] ))

Here is some sample output:

seq          start  stop  gap  dir  seq
NC_003030.1  733    767   23   R    GATGAAGATCAAGAAACCGATACAAACAATGTAAA
NC_003030.1  736    767   20   R    GAAGATCAAGAAACCGATACAAACAATGTAAA
NC_003030.1  739    767   17   R    GATCAAGAAACCGATACAAACAATGTAAA
NC_003030.1  742    767   14   R    CAAGAAACCGATACAAACAATGTAAA
NC_003030.1  743    767   13   R    AAGAAACCGATACAAACAATGTAAA
NC_003030.1  2665   2697  21   R    TATTAAAATTTTAGAGGATGATGGTGATGTAAA
NC_003030.1  2668   2697  18   R    TAAAATTTTAGAGGATGATGGTGATGTAAA
NC_003030.1  3798   3822  13   R    CACCAGAAGACTTAAAAATTGTAAA
NC_003030.1  3801   3822  10   R    CAGAAGACTTAAAAATTGTAAA

Thank you so much for this. I really appreciate it. I am however having some issues running the script: — syn_bio_delta, Dec 11 '18 at 23:16
I have installed biopython. Do I need to download 'NC_003030.fasta' and specify the directory? — syn_bio_delta, Dec 11 '18 at 23:18
Yes, you need to change 'NC_003030.fasta' to the path of your file, such as filename = "/I/saved/my/file/here/NC_003030.fasta" — conchoecia, Dec 11 '18 at 23:29
Hello, someone would like to use this code too so I thought I had better check with you for permission to share - even though this is an open source forum it seems like the proper thing to do. Also, I will almost certainly use this in my thesis and will need to reference you. Can you provide a name/affiliation? — syn_bio_delta, Dec 14 '18 at 14:29
Thanks! That is awfully considerate of you to ask. If you would like to do that, while totally not necessary, I found a few relevant SE questions: academia.stackexchange.com/questions/40325/ and meta.stackexchange.com/questions/191943 . A quick search of my username on some other sites popular with scientists/programmers and you may find my contact info. I'm also working on my thesis and usually read 100+ SE questions per day - this is a good thing for me to consider too! — conchoecia, Dec 14 '18 at 18:25
No problem. I guess since you are a dry lab scientist it's very common for you to share scripts with your colleagues without thinking much of it, it would be unrealistic to trace and record the origin of every line of code perhaps. I am a wet lab scientist and I'm not really familiar with the customs amongst working coders. I will be more or less pasting the code into my M&M so it really wouldn't be right to not reference it in my case. — syn_bio_delta, Dec 15 '18 at 16:21
I do have one more question. Only two bases appear to be conserved in the -10 sequence. How might I specify a variable base. I have tried using an 'x' an 'n' or a '-' eg 'ntnntn' none of these variations produce a result, but I do not receive an error message either.
' — syn_bio_delta, Dec 15 '18 at 16:25
@Ryan_J_Hope many are both wetlab and bioinformaticians. You're doing a great thing by taking charge of both research fronts! If you would like an expanded answer allowing the gap character '-' maybe update the question? A version allowing redundancy but no gaps is above in the latest edits. — conchoecia, Dec 17 '18 at 20:44

How to find all variable-length seqs with an exact 5' and 3' match in a FASTA file

Context

Desired output

1 Answers1

python 3 solution

python 3 with redundancy