0

I'm given a Fasta file, containing a large DNA(over 115,000 long) sequence, and I am tasked with finding a single large open reading frame contained within the DNA sequence using Biopython.

I'm aware this has been asked before consistently however i can't find anything for actually finding the source of my ORF, but apologies for any overlap.

Now from other sources and the Biopython cookbook I've translated my sequence and found six open reading frames (three for each strand) and their positions within the sequence;

def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end-aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame+aa_start*3
                        end = min(seq_len,frame+aa_end*3+3)
                    else:
                        start = seq_len-frame-aa_end*3-3
                        end = seq_len-frame-aa_start*3
                    answer.append((start, end, strand,
                                   trans[aa_start:aa_end]))
                aa_start = aa_end+1
    answer.sort()
    return answer

orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)

for start, end, strand, pro in orf_list:
print("%s...%s - length %i, strand %i, %i:%i" \
      % (pro[:30], pro[-3:], len(pro), strand, start, end))

OUTPUT

TNRQVYGGTLQSLRTGTGIYSRLASSPTNR...CEA - length 160, strand 1, 7950:8433 GIPPGRTEGLGRYVHGESIFLEATLVPEPQ...VQS - length 171, strand -1, 19275:19791 ISVHLRRYFSVLLRAPVALNADSRVAGPLD...ITP - length 190, strand 1, 34079:34652 GCVNGNFPDYRVAVDDPGALVVGGELGQAL...VNT - length 771, strand -1, 39335:41651 LLKLLASRLTWLLVLSAALGFLTSVSYRLG...SPG - length 235, strand -1, 39358:40066 PGDVIVSKVPADNLRPRMSEINDFTNSIII...VLV - length 826, strand 1, 39362:41843 GPGRQSSSAYERDQRLHQQHHHQRERHQVR...QRS - length 764, strand 1, 39385:41680 LLLSSTSSASCWLKKFFFRLLLLSRSVALL...LAL - length 158, strand -1, 40459:40936 TRLIIRPPSHTTLYYDPYPQPLKVSSSVPC...RIT - length 159, strand -1, 54663:55143 IGAFSRGYNQGWVLIEAQEVRTTSPLFPRS...RDQ - length 201, strand 1, 56104:56710 KARGSSLGAIYCRRLKRRSLPSWQIESFNC...TNS - length 150, strand -1, 62497:62950 RRPARAGAELNKDDIRDTCLLSCSEVDNKV...VTL - length 168, strand 1, 72339:72846 SGPTTVRTSAPQRCWRTSLKPKLSLSDSTE...PVA - length 177, strand -1, 114494:115028 ERELWLETGPPTPLWGTGPDCSRAALNRQR...PRR - length 157, strand 1, 114953:115427

However I'm unsure if these open reading frames contain the ribosome binding site(consensus sequence of AAGGAGGTG and occurring between 6 to 9 nucleotides upstream from the positive 5' strand). Is there anyway to add to my code to include this? or would it already be included with Biopython?

I'm left with the output however I've no idea how to link this with possibly an online source like NCBI to actually find the source of my ORF (i.e. which protein it comes from). Is there anyway to use Biopython for this as well?

daenwaels
  • 41
  • 1

0 Answers0