Group genes by functional categories suming expression values

Question

Using the count of rpkm values from genes in a metagenome sample, I want to group these genes into established categories (for example KEGG or COG). For each sample, my goal is to determine which categories are better represented in each one.

Considering the example table whit genes, samples (1-5) and categories. Each gene has their expression and the column categories classify them.

import sys
import numpy

dictionary = {}

table = open(sys.argv[1], "r")


for line in table:
    next()
    cols = line.strip().split("\t")
    gene = cols[0]
    s1 = float(cols[1])
    s2 = float(cols[2])
    s3 = float(cols[3])
    s4 = float(cols[4])
    s5 = float(cols[5])
    counts = [s1, s2, s3, s4, s5]
    kegg = cols[6].strip().split(" ")[0].replace('"','')


    if kegg not in dictionary:
        dictionary[kegg] = [counts]
    else:
        dictionary[kegg].append(counts)


#print dictionary
for k, v in sorted(dictionary.iteritems()):
    m = numpy.array(v)
    print k,  [sum(m[:,i]) for i in range(5)] # range(number_of_samples)

Original table to count the columns values:

gene    s1  s2  s3  s4  s5  Category
name01  0   2   2   0   0   A
name02  3   0   1   0   0   A
name03  0   2   1   0   0   A
name04  0   0   1   0   0   B
name05  5   0   1   0   0   C
name06  1   0   0   0   0   D
name07  2   0   0   0   0   D
name08  0   0   3   0   0   E
name09  1   0   0   0   0   F
name10  1   0   0   0   0   F
name11  3   0   0   0   0   F

The result obtained is this:

"""
# eliminated from the question

Cat s1  s2  s3  s4  s5
A   0.0 2.0 1.0 0.0 0.0
B   0.0 0.0 1.0 0.0 0.0
C   5.0 0.0 1.0 0.0 0.0
D   2.0 0.0 0.0 0.0 0.0
E   0.0 0.0 3.0 0.0 0.0
F   3.0 0.0 0.0 0.0 0.0
"""

But my expected (obtained) table is this:

Cat s1  s2  s3  s4  s5
A   3.0 4.0 4.0 0.0 0.0
B   0.0 0.0 1.0 0.0 0.0
C   5.0 0.0 1.0 0.0 0.0
D   3.0 0.0 0.0 0.0 0.0
E   0.0 0.0 3.0 0.0 0.0
F   5.0 0.0 0.0 0.0 0.0

Have you looked into using the pandas library? It is specifically designed to do these transformations. — Bioathlete, Nov 07 '18 at 19:49
not yet. Thank you for your suggestion. If you have some idea how to use it, I thank you — F.Lira, Nov 07 '18 at 21:49
I suspect you do not build your dictionary correctly. If you append a list to a list, you get a list inside a list. extend is probably what you want. See https://stackoverflow.com/a/252711/1878788 Besides, you might be interested in collections.defaultdict: dictionary = defaultdict(list) and then: dictionary.extend(counts), without need to test whether it already has a kegg entry. — bli, Nov 08 '18 at 12:21
@bli I corrected the script and updated the question. Now it works. — F.Lira, Nov 08 '18 at 13:11

bli · Accepted Answer · 2018-11-08T12:27:57.697

As suggested in the comments, the pandas module is quite convenient for this type of work.

Here is how you could do (assuming you have tab-separated values as input, and you want similar file format for the output):

import pandas as pd

counts = pd.read_table("counts.tsv")
# At this point, counts looks as follows:
#       gene  s1  s2  s3  s4  s5 Category
# 0   name01   0   2   2   0   0        A
# 1   name02   3   0   1   0   0        A
# 2   name03   0   2   1   0   0        A
# 3   name04   0   0   1   0   0        B
# 4   name05   5   0   1   0   0        C
# 5   name06   1   0   0   0   0        D
# 6   name07   2   0   0   0   0        D
# 7   name08   0   0   3   0   0        E
# 8   name09   1   0   0   0   0        F
# 9   name10   1   0   0   0   0        F
# 10  name11   3   0   0   0   0        F

summed_by_cat = counts.groupby("Category").sum()
# At this point, summed_by_cats looks as follows:
#           s1  s2  s3  s4  s5
# Category                    
# A          3   4   4   0   0
# B          0   0   1   0   0
# C          5   0   1   0   0
# D          3   0   0   0   0
# E          0   0   3   0   0
# F          5   0   0   0   0

# Change the index name to get the desired first column header in the output:
summed_by_cat.index.name = "Cat"

# Write to a file
summed_by_cat.to_csv("by_cat.tsv", sep="\t")

And resulting the file contains the following:

Cat s1  s2  s3  s4  s5
A   3   4   4   0   0
B   0   0   1   0   0
C   5   0   1   0   0
D   3   0   0   0   0
E   0   0   3   0   0
F   5   0   0   0   0

You could also load your table using counts = pd.read_table("data.tsv", index_col="gene") and counts would have the gene names as index instead of default numbers:

        s1  s2  s3  s4  s5 Category
gene                               
name01   0   2   2   0   0        A
name02   3   0   1   0   0        A
name03   0   2   1   0   0        A
name04   0   0   1   0   0        B
name05   5   0   1   0   0        C
name06   1   0   0   0   0        D
name07   2   0   0   0   0        D
name08   0   0   3   0   0        E
name09   1   0   0   0   0        F
name10   1   0   0   0   0        F
name11   3   0   0   0   0        F

This doesn't seem to affect the final outcome.

score 1 · Answer 2 · answered Nov 07 '18 at 19:58

Below is code to do that, though if these are really gene counts then I highly suggest you provide more details about why you're trying to do this before you actually do it.

#!/usr/bin/env python
print("Cat s1  s2  s3  s4  s5")
v = []
category = None
for line in open("counts.txt"):
    if line.startswith("gene"):
        continue
    cols = line.split()
    if cols[-1] == category:
        for idx in range(5):
            v[idx] += int(cols[idx + 1])
    else:
        if category:
            print("{}   {}".format(category, " ".join(["{:.1f}".format(x) for x in v])))
        v = [int(x) for x in cols[1:-1]]
        category = cols[-1]
print("{}   {}".format(category, " ".join(["{:.1f}".format(x) for x in v])))

Group genes by functional categories suming expression values

2 Answers2