Is there any function in Biopython to convert a DNA sequence from ambiguous to unambiguous?

Question

I have a project to write in Python that requires to write a function that given degenerate DNA sequence (for example: KKGTACACCAG) sequence and a molecular weight interval, returns a list of all unambiguous sequences represented by the given sequence (encoding based on IUPAC alphabet: http://www.bioinformatics.org/sms/iupac.html).

Is there any function in Biopython to convert a DNA sequence from degenerate to unambiguous base pairs?

Also, Is it possible to use the molecular weight function from Bio.SeqUtils library for an ambiguous DNA?

M__ · Answer 1 · 2022-02-19T03:17:43.647

3

from Bio.Seq import Seq
coding_dna = Seq("AAYGANAGYCARAGYAAR")
print(coding_dna.translate())

This gives

NXSQSK

So .translate() will read a degenerate code and provide the unambiguous amino acid, except where it is unable to do so e.g. GAN because this is a split between asp and glu. Thus GAR or GAY is fine, but GAN gives X

You would then use itertools by mapping (map) against the triplet code (import Bio.Data.CodonTable) to obtain all permutations. Doesn't work for X admittedly and stop codons a special case (3 possibilities).

Bio.Data.CodonTable.list_possible_proteins would solve the X problem, but stop codons remain a special case.

I seriously did like the Stackoverflow link in the comments, with the double map funtion ... very slick.

edited Feb 19 '22 at 03:17

answered Feb 18 '22 at 22:24

M__

12,263
5
28
47

1

I am missing the SO link the one with : the double map funtion ... , could you add it to your answer – pippo1980 Jul 14 '23 at 17:20
@pippo1980ONSTRIKE I can't do it now, but I'll look at it early next week. Please check the answer below from acvill – M__ Jul 14 '23 at 17:23
1

cheers, but R is like ancient greek to me, I am following this question though – pippo1980 Jul 14 '23 at 17:28

acvill · Answer 2 · 2023-08-16T16:26:04.057

M__'s approach is good, but, as explained, some subset of the unambiguous DNA sequences will not be recovered because of the ambiguity in reverse-translation due to the degeneracy of the genetic code. I don't know a python approach, but here's an R function that returns a vector of all the possible unambiguous DNA strings:

make_unambiguous <- function(dna) {
  require(tidyverse)
  require(S4Vectors)
  iupac <- tibble(code = c("A", "C", "G", "T",
                           "R", "Y", "S", "W",
                           "K", "M", "B", "D",
                           "H", "V", "N"),
                  base = c("A", "C", "G", "T",
                           "AG", "CT", "GC", "AT",
                           "GT", "AC", "CGT", "AGT",
                           "ACT", "ACG", "ACGT"))
tibble(dna) |>
    separate_rows(dna, sep = '(?<=.)(?=.)') |>
    left_join(iupac, by = c("dna" = "code")) |>
    pull(base) |>
    str_split("") |>
    expand.grid(stringsAsFactors = FALSE) |>
    unite(col = sequence, sep = "") |>
    as_vector()
}
make_unambiguous(dna = "AAYGANAGYCARAGYAAR")

Is there any function in Biopython to convert a DNA sequence from ambiguous to unambiguous?

2 Answers2