Using the count of rpkm values from genes in a metagenome sample, I want to group these genes into established categories (for example KEGG or COG). For each sample, my goal is to determine which categories are better represented in each one.
Considering the example table whit genes, samples (1-5) and categories. Each gene has their expression and the column categories classify them.
import sys
import numpy
dictionary = {}
table = open(sys.argv[1], "r")
for line in table:
next()
cols = line.strip().split("\t")
gene = cols[0]
s1 = float(cols[1])
s2 = float(cols[2])
s3 = float(cols[3])
s4 = float(cols[4])
s5 = float(cols[5])
counts = [s1, s2, s3, s4, s5]
kegg = cols[6].strip().split(" ")[0].replace('"','')
if kegg not in dictionary:
dictionary[kegg] = [counts]
else:
dictionary[kegg].append(counts)
#print dictionary
for k, v in sorted(dictionary.iteritems()):
m = numpy.array(v)
print k, [sum(m[:,i]) for i in range(5)] # range(number_of_samples)
Original table to count the columns values:
gene s1 s2 s3 s4 s5 Category
name01 0 2 2 0 0 A
name02 3 0 1 0 0 A
name03 0 2 1 0 0 A
name04 0 0 1 0 0 B
name05 5 0 1 0 0 C
name06 1 0 0 0 0 D
name07 2 0 0 0 0 D
name08 0 0 3 0 0 E
name09 1 0 0 0 0 F
name10 1 0 0 0 0 F
name11 3 0 0 0 0 F
The result obtained is this:
"""
# eliminated from the question
Cat s1 s2 s3 s4 s5
A 0.0 2.0 1.0 0.0 0.0
B 0.0 0.0 1.0 0.0 0.0
C 5.0 0.0 1.0 0.0 0.0
D 2.0 0.0 0.0 0.0 0.0
E 0.0 0.0 3.0 0.0 0.0
F 3.0 0.0 0.0 0.0 0.0
"""
But my expected (obtained) table is this:
Cat s1 s2 s3 s4 s5
A 3.0 4.0 4.0 0.0 0.0
B 0.0 0.0 1.0 0.0 0.0
C 5.0 0.0 1.0 0.0 0.0
D 3.0 0.0 0.0 0.0 0.0
E 0.0 0.0 3.0 0.0 0.0
F 5.0 0.0 0.0 0.0 0.0
appenda list to a list, you get a list inside a list.extendis probably what you want. See https://stackoverflow.com/a/252711/1878788 Besides, you might be interested incollections.defaultdict:dictionary = defaultdict(list)and then:dictionary.extend(counts), without need to test whether it already has akeggentry. – bli Nov 08 '18 at 12:21