I have a list of sequences:
CAGGTAGCC
CCGGTCAGA
AGGGTTTGA
TTGGTGAGG
CAAGTATGA
ACTGTATGC
CTGGTAACC
Each sequence is nine nucleotides long. I want to calculate the dinucleotide frequency of AA, CG ..etc for each of the dinucleotide positions. Similar problem shared with one nucleotide in this post.
For example, 1st two column will have these bases CA,CC,AG,TT etc, and second two should be AG,CG,GG,TG and so on. Output should be something like this
AA 0.25 0.34 0.56 0.43 0.00 0.90 0.45 0.34 0.31
CC 0.45 0.40 0.90 0.00 0.40 0.90 0.30 0.25 0.2
GC 0.00 0.00 0.34 1.00 0.30 0.30 0.35 0.90 0.1
TC 0.24 0.56 0.00 0.00 1.00 0.34 0.45 0.35 0.36
AA, CC, GC and TC could be any dinucleotide combination.
What I tried from the last post: mainly taken from @Devon
import itertools
hl = []
#making all possible dinucleotide combinations
bases=['A','T','G','C']
k = 2
kmers = [''.join(p) for p in itertools.product(bases, repeat=k)]
kmers = dict.fromkeys(kmers, 0)
print(kmers)
for i in range(9):
hl.append(kmers)
my seq file
f = open("H3K27ac.seq")
nLines = 0
for line in f:
id = 0
trying to read dinucleotide at each loop
for (fst, sec) in zip(line[0::2], line[1::2]):
id = id + 1
hl[id][fst+sec] += 1
nLines += 1
f.close()
nLines = float(nLines)
for char in ['AA', 'TT', 'CC', 'GG', 'CG', 'GC']:
print("{}\t{}".format(char, "\t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))
By somehow it's estimation the total count of all dinucleotide in whole file. I am trying to make it more general so that it can be used for tri- and any further number of nucleotides.