calculating kmer nucleotide frequency per column

Question

I have a list of sequences:

CAGGTAGCC
CCGGTCAGA
AGGGTTTGA
TTGGTGAGG
CAAGTATGA
ACTGTATGC
CTGGTAACC

Each sequence is nine nucleotides long. I want to calculate the dinucleotide frequency of AA, CG ..etc for each of the dinucleotide positions. Similar problem shared with one nucleotide in this post.

For example, 1st two column will have these bases CA,CC,AG,TT etc, and second two should be AG,CG,GG,TG and so on. Output should be something like this

AA   0.25    0.34    0.56    0.43    0.00    0.90    0.45    0.34    0.31
CC   0.45    0.40    0.90    0.00    0.40    0.90    0.30    0.25    0.2
GC   0.00    0.00    0.34    1.00    0.30    0.30    0.35    0.90    0.1
TC   0.24    0.56    0.00    0.00    1.00    0.34    0.45    0.35    0.36

AA, CC, GC and TC could be any dinucleotide combination.

What I tried from the last post: mainly taken from @Devon

import itertools
hl = []
#making all possible dinucleotide combinations
bases=['A','T','G','C']
k = 2
kmers = [''.join(p) for p in itertools.product(bases, repeat=k)]
kmers = dict.fromkeys(kmers, 0)
print(kmers)
for i in range(9):
    hl.append(kmers)
my seq file
f = open("H3K27ac.seq")
nLines = 0
for line in f:
    id = 0
trying to read dinucleotide at each loop
for (fst, sec) in zip(line[0::2], line[1::2]):
    id = id + 1
    hl[id][fst+sec] += 1
nLines += 1

f.close()
nLines = float(nLines)
for char in ['AA', 'TT', 'CC', 'GG', 'CG', 'GC']:

    print("{}\t{}".format(char, "\t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))

By somehow it's estimation the total count of all dinucleotide in whole file. I am trying to make it more general so that it can be used for tri- and any further number of nucleotides.

score 2 · Accepted Answer · answered Aug 04 '21 at 20:36

This should do the trick. The following code works for any aligned character set, but assumes that all sequences should be the same length.

import itertools
with open("f.txt") as seqs:
    sequences = list(line.strip() for line in seqs.readlines())
Just in case sequence lengths are different
n = min(len(s) for s in sequences)
if any(n != len(s) for s in sequences):
    print("Sequence lengths are different!")
Get all bases.
bases = set(c for seq in sequences for c in seq)
k = 2 
kmers = [''.join(p) for p in itertools.product(bases, repeat=k)]
pos_to_freq = []
for i in range(n - k + 1):
    kmers_dict = dict.fromkeys(kmers, 0)
    for s in sequences:
        kmers_dict[s[i:i+k]] += 1/len(sequences)
    pos_to_freq.append(kmers_dict)
for kmer in kmers:
    print("\t".join([kmer] + [str(round(d[kmer], 3)) for d in pos_to_freq]))

Output for your example is below (I apologize for the formatting):

AA  0   0.143   0   0   0   0.143   0   0
AT  0   0   0   0   0   0.286   0   0
AC  0.143   0   0   0   0   0   0.143   0
AG  0.143   0.143   0.143   0   0   0.143   0.286   0
TA  0   0   0   0   0.571   0   0   0
TT  0.143   0   0   0   0.143   0.143   0   0
TC  0   0   0   0   0.143   0   0   0
TG  0   0.286   0.143   0   0.143   0   0.429   0
CA  0.286   0   0   0   0   0.143   0   0
CT  0.143   0.143   0   0   0   0   0   0
CC  0.143   0   0   0   0   0   0   0.286
CG  0   0.143   0   0   0   0   0   0
GA  0   0   0   0   0   0.143   0   0.429
GT  0   0   0   1.0 0   0   0   0
GC  0   0   0   0   0   0   0.143   0.143
GG  0   0.143   0.714   0   0   0   0   0.143

Works perfect with python3. Many thanks – kashiff007 Aug 04 '21 at 20:57 — kashiff007, Aug 04 '21 at 20:57

calculating kmer nucleotide frequency per column

my seq file

trying to read dinucleotide at each loop

1 Answers1

Just in case sequence lengths are different

Get all bases.