7

I have a project to write in Python that requires to write a function that given degenerate DNA sequence (for example: KKGTACACCAG) sequence and a molecular weight interval, returns a list of all unambiguous sequences represented by the given sequence (encoding based on IUPAC alphabet: http://www.bioinformatics.org/sms/iupac.html).

Is there any function in Biopython to convert a DNA sequence from degenerate to unambiguous base pairs?

Also, Is it possible to use the molecular weight function from Bio.SeqUtils library for an ambiguous DNA?

gringer
  • 14,012
  • 5
  • 23
  • 79
Ke Keiss
  • 71
  • 2

2 Answers2

3
from Bio.Seq import Seq
coding_dna = Seq("AAYGANAGYCARAGYAAR")
print(coding_dna.translate())

This gives

NXSQSK

So .translate() will read a degenerate code and provide the unambiguous amino acid, except where it is unable to do so e.g. GAN because this is a split between asp and glu. Thus GAR or GAY is fine, but GAN gives X

You would then use itertools by mapping (map) against the triplet code (import Bio.Data.CodonTable) to obtain all permutations. Doesn't work for X admittedly and stop codons a special case (3 possibilities).

Bio.Data.CodonTable.list_possible_proteins would solve the X problem, but stop codons remain a special case.


I seriously did like the Stackoverflow link in the comments, with the double map funtion ... very slick.

M__
  • 12,263
  • 5
  • 28
  • 47
  • 1
    I am missing the SO link the one with : the double map funtion ... , could you add it to your answer – pippo1980 Jul 14 '23 at 17:20
  • @pippo1980ONSTRIKE I can't do it now, but I'll look at it early next week. Please check the answer below from acvill – M__ Jul 14 '23 at 17:23
  • 1
    cheers, but R is like ancient greek to me, I am following this question though – pippo1980 Jul 14 '23 at 17:28
3

M__'s approach is good, but, as explained, some subset of the unambiguous DNA sequences will not be recovered because of the ambiguity in reverse-translation due to the degeneracy of the genetic code. I don't know a python approach, but here's an R function that returns a vector of all the possible unambiguous DNA strings:

make_unambiguous <- function(dna) {
  require(tidyverse)
  require(S4Vectors)
  iupac <- tibble(code = c("A", "C", "G", "T",
                           "R", "Y", "S", "W",
                           "K", "M", "B", "D",
                           "H", "V", "N"),
                  base = c("A", "C", "G", "T",
                           "AG", "CT", "GC", "AT",
                           "GT", "AC", "CGT", "AGT",
                           "ACT", "ACG", "ACGT"))

tibble(dna) |> separate_rows(dna, sep = '(?<=.)(?=.)') |> left_join(iupac, by = c("dna" = "code")) |> pull(base) |> str_split("") |> expand.grid(stringsAsFactors = FALSE) |> unite(col = sequence, sep = "") |> as_vector() }

make_unambiguous(dna = "AAYGANAGYCARAGYAAR")

acvill
  • 613
  • 3
  • 12