0

I have a list of sequences:

CAGGTAGCC
CCGGTCAGA
AGGGTTTGA
TTGGTGAGG
CAAGTATGA
ACTGTATGC
CTGGTAACC

Each sequence is nine nucleotides long. I want to calculate the dinucleotide frequency of AA, CG ..etc for each of the dinucleotide positions. Similar problem shared with one nucleotide in this post.

For example, 1st two column will have these bases CA,CC,AG,TT etc, and second two should be AG,CG,GG,TG and so on. Output should be something like this

AA   0.25    0.34    0.56    0.43    0.00    0.90    0.45    0.34    0.31
CC   0.45    0.40    0.90    0.00    0.40    0.90    0.30    0.25    0.2
GC   0.00    0.00    0.34    1.00    0.30    0.30    0.35    0.90    0.1
TC   0.24    0.56    0.00    0.00    1.00    0.34    0.45    0.35    0.36

AA, CC, GC and TC could be any dinucleotide combination.

What I tried from the last post: mainly taken from @Devon

import itertools
hl = []

#making all possible dinucleotide combinations bases=['A','T','G','C'] k = 2

kmers = [''.join(p) for p in itertools.product(bases, repeat=k)] kmers = dict.fromkeys(kmers, 0) print(kmers) for i in range(9): hl.append(kmers)

my seq file

f = open("H3K27ac.seq") nLines = 0

for line in f: id = 0

trying to read dinucleotide at each loop

for (fst, sec) in zip(line[0::2], line[1::2]):
    id = id + 1
    hl[id][fst+sec] += 1
nLines += 1

f.close()

nLines = float(nLines) for char in ['AA', 'TT', 'CC', 'GG', 'CG', 'GC']:
print("{}\t{}".format(char, "\t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))

By somehow it's estimation the total count of all dinucleotide in whole file. I am trying to make it more general so that it can be used for tri- and any further number of nucleotides.

kashiff007
  • 103
  • 3

1 Answers1

2

This should do the trick. The following code works for any aligned character set, but assumes that all sequences should be the same length.

import itertools

with open("f.txt") as seqs: sequences = list(line.strip() for line in seqs.readlines())

Just in case sequence lengths are different

n = min(len(s) for s in sequences) if any(n != len(s) for s in sequences): print("Sequence lengths are different!")

Get all bases.

bases = set(c for seq in sequences for c in seq)

k = 2 kmers = [''.join(p) for p in itertools.product(bases, repeat=k)] pos_to_freq = []

for i in range(n - k + 1): kmers_dict = dict.fromkeys(kmers, 0) for s in sequences: kmers_dict[s[i:i+k]] += 1/len(sequences) pos_to_freq.append(kmers_dict)

for kmer in kmers: print("\t".join([kmer] + [str(round(d[kmer], 3)) for d in pos_to_freq]))

Output for your example is below (I apologize for the formatting):

AA  0   0.143   0   0   0   0.143   0   0
AT  0   0   0   0   0   0.286   0   0
AC  0.143   0   0   0   0   0   0.143   0
AG  0.143   0.143   0.143   0   0   0.143   0.286   0
TA  0   0   0   0   0.571   0   0   0
TT  0.143   0   0   0   0.143   0.143   0   0
TC  0   0   0   0   0.143   0   0   0
TG  0   0.286   0.143   0   0.143   0   0.429   0
CA  0.286   0   0   0   0   0.143   0   0
CT  0.143   0.143   0   0   0   0   0   0
CC  0.143   0   0   0   0   0   0   0.286
CG  0   0.143   0   0   0   0   0   0
GA  0   0   0   0   0   0.143   0   0.429
GT  0   0   0   1.0 0   0   0   0
GC  0   0   0   0   0   0   0.143   0.143
GG  0   0.143   0.714   0   0   0   0   0.143
Throckmorton
  • 810
  • 6
  • 15