nested dictionary of bin sizes from groupby multiple columns

Question

df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3], 'b': [5,5,1,1,3,3,3,1,2,1,1,1,]})
>>> df
    a  b
0   1  5
1   1  5
2   1  1
3   1  1
4   2  3
5   2  3
6   2  3
7   2  1
8   3  2
9   3  1
10  3  1
11  3  1
>>> df.groupby(['a','b']).size().to_dict()
{(1, 5): 2, (3, 2): 1, (2, 3): 3, (3, 1): 3, (1, 1): 2, (2, 1): 1}

What I am getting is the counts of each a and b combination with a tuple of the pair as key but what I am trying to get to is:

{1: {5: 2, 1: 2}, 2: {3: 3, 1: 1}, 3: {2: 1, 1: 3} }

score 3 · Accepted Answer · answered Apr 19 '18 at 15:35

3

You'll need an additional groupby inside a dict comprehension:

i = df.groupby(['a','b']).size().reset_index(level=1)
j = {k : dict(g.values) for k, g in i.groupby(level=0)}

print(j)
{
    1: {1: 2, 5: 2}, 
    2: {1: 1, 3: 3}, 
    3: {1: 3, 2: 1}
}

answered Apr 19 '18 at 15:35

cs95

330,695
80
606
657

score 2 · Answer 2 · answered Apr 19 '18 at 15:44

2

You can use collections.defaultdict for an O(n) solution.

from collections import defaultdict

df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3], 'b': [5,5,1,1,3,3,3,1,2,1,1,1,]})**Option 2: defaultdict**

d = defaultdict(lambda: defaultdict(int))

for i, j in map(tuple, df.values):
    d[i][j] += 1

# defaultdict(<function __main__.<lambda>>,
#             {1: defaultdict(int, {1: 2, 5: 2}),
#              2: defaultdict(int, {1: 1, 3: 3}),
#              3: defaultdict(int, {1: 3, 2: 1})})

answered Apr 19 '18 at 15:44

jpp

147,904
31
244
302

thanks for your answer. that is the approach I am currently using. I was just wandering whether pandas tools offer a vectorised approach to achieving this – Tony Apr 19 '18 at 15:55
My solution is *not* vectorised, it is a pure Python loop. – jpp Apr 19 '18 at 16:07
1

@Tony as a general rule, don't assume `groupby` or `apply` means `vectorized`... it doesn't. jpp is right to highlight O(n) solutions. However, cᴏʟᴅsᴘᴇᴇᴅ has provided an O(n) solution as well. If performance is an issue, make sure to say so in your question. It will inform us how to answer. And jpp is right again to suggest you should test this on your data. It is wrong to assume that a simple for loop is **always** worse. – piRSquared Apr 19 '18 at 16:08
@piRSquared I didn't mention it in my question because in my mind the simplest solution would involve something similar to this: [link](https://stackoverflow.com/questions/41998624/how-to-convert-pandas-dataframe-to-nested-dictionary) that I could just not figure out myself. You are right in that I should be more explicit in my request. Thanks for your answers – Tony Apr 19 '18 at 16:12
I'll go on as why I like this approach. Much of the overhead involved in looping (even when O(n)) is the creation of objects. In my solution and cᴏʟᴅsᴘᴇᴇᴅ's, we are creating Pandas objects within a comprehension. jpp's solution avoids that overhead and simply adds to an existing key. This should be efficient – piRSquared Apr 19 '18 at 16:12

score 2 · Answer 3 · answered Apr 19 '18 at 15:53

2

from collections import Counter
import pandas as pd

s = pd.Series(Counter(zip(df.a, df.b)))
{
    n: d.xs(n).to_dict()
    for n, d in s.groupby(level=0)
}

{1: {1: 2, 5: 2}, 2: {1: 1, 3: 3}, 3: {1: 3, 2: 1}}

answered Apr 19 '18 at 15:53

piRSquared

265,629
48
427
571

nested dictionary of bin sizes from groupby multiple columns

3 Answers3