2
df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3], 'b': [5,5,1,1,3,3,3,1,2,1,1,1,]})
>>> df
    a  b
0   1  5
1   1  5
2   1  1
3   1  1
4   2  3
5   2  3
6   2  3
7   2  1
8   3  2
9   3  1
10  3  1
11  3  1
>>> df.groupby(['a','b']).size().to_dict()
{(1, 5): 2, (3, 2): 1, (2, 3): 3, (3, 1): 3, (1, 1): 2, (2, 1): 1}

What I am getting is the counts of each a and b combination with a tuple of the pair as key but what I am trying to get to is:

{1: {5: 2, 1: 2}, 2: {3: 3, 1: 1}, 3: {2: 1, 1: 3} }
jpp
  • 147,904
  • 31
  • 244
  • 302
Tony
  • 761
  • 5
  • 19

3 Answers3

3

You'll need an additional groupby inside a dict comprehension:

i = df.groupby(['a','b']).size().reset_index(level=1)
j = {k : dict(g.values) for k, g in i.groupby(level=0)}

print(j)
{
    1: {1: 2, 5: 2}, 
    2: {1: 1, 3: 3}, 
    3: {1: 3, 2: 1}
}
cs95
  • 330,695
  • 80
  • 606
  • 657
2

You can use collections.defaultdict for an O(n) solution.

from collections import defaultdict

df = pd.DataFrame({'a': [1,1,1,1,2,2,2,2,3,3,3,3], 'b': [5,5,1,1,3,3,3,1,2,1,1,1,]})**Option 2: defaultdict**

d = defaultdict(lambda: defaultdict(int))

for i, j in map(tuple, df.values):
    d[i][j] += 1

# defaultdict(<function __main__.<lambda>>,
#             {1: defaultdict(int, {1: 2, 5: 2}),
#              2: defaultdict(int, {1: 1, 3: 3}),
#              3: defaultdict(int, {1: 3, 2: 1})})
jpp
  • 147,904
  • 31
  • 244
  • 302
  • thanks for your answer. that is the approach I am currently using. I was just wandering whether pandas tools offer a vectorised approach to achieving this – Tony Apr 19 '18 at 15:55
  • My solution is *not* vectorised, it is a pure Python loop. – jpp Apr 19 '18 at 16:07
  • 1
    @Tony as a general rule, don't assume `groupby` or `apply` means `vectorized`... it doesn't. jpp is right to highlight O(n) solutions. However, cᴏʟᴅsᴘᴇᴇᴅ has provided an O(n) solution as well. If performance is an issue, make sure to say so in your question. It will inform us how to answer. And jpp is right again to suggest you should test this on your data. It is wrong to assume that a simple for loop is **always** worse. – piRSquared Apr 19 '18 at 16:08
  • @piRSquared I didn't mention it in my question because in my mind the simplest solution would involve something similar to this: [link](https://stackoverflow.com/questions/41998624/how-to-convert-pandas-dataframe-to-nested-dictionary) that I could just not figure out myself. You are right in that I should be more explicit in my request. Thanks for your answers – Tony Apr 19 '18 at 16:12
  • I'll go on as why I like this approach. Much of the overhead involved in looping (even when O(n)) is the creation of objects. In my solution and cᴏʟᴅsᴘᴇᴇᴅ's, we are creating Pandas objects within a comprehension. jpp's solution avoids that overhead and simply adds to an existing key. This should be efficient – piRSquared Apr 19 '18 at 16:12
2
from collections import Counter
import pandas as pd

s = pd.Series(Counter(zip(df.a, df.b)))
{
    n: d.xs(n).to_dict()
    for n, d in s.groupby(level=0)
}

{1: {1: 2, 5: 2}, 2: {1: 1, 3: 3}, 3: {1: 3, 2: 1}}
piRSquared
  • 265,629
  • 48
  • 427
  • 571