0

Is it possible to succinctly perform a groupby and aggregation on a DataFrame whose columns are a MultiIndex?

For instance, on the DataFrame below:

cols = pd.MultiIndex.from_tuples([("ColGrp1", "col1"), 
                                  ("ColGrp1", "col2"), 
                                  ("ColGrp2", "agg1"), 
                                  ("ColGrp2", "agg2")
                                  ])

df = pd.DataFrame([["a", "a", "a", "b", "b"],
                   ["c", "c", "c", "d", "d"],
                   [1, 2, 3, 4, 5],
                   [10, 20, 30, 40, 50]]
                  ).T

df.columns = cols

Which looks like:

  ColGrp1      ColGrp2
     col1 col2    agg1 agg2
0       a    c       1   10
1       a    c       2   20
2       a    c       3   30
3       b    d       4   40
4       b    d       5   50

If we then wanted to group by all the columns of ColGrp1 while aggregating all the columns of ColGrp2. Say, taking their sum & mean -- so the result would look like:

                                ColGrp2
                                   agg1      agg2
                                    sum mean  sum  mean
(ColGrp1, col1) (ColGrp1, col2)
a               c                     6  2.0   60  20.0
b               d                     9  4.5   90  45.0

I'm wondering if there's a succinct syntax for this? I had hoped something like the below would work:

df.groupby("ColGrp1")["ColGrp2"].agg(["sum", "mean"])

Or:

df.groupby("ColGrp1", level=0)["ColGrp2"].agg(["sum", "mean"])

However these both give errors -- ValueError: Grouper for 'ColGrp1' not 1-dimensional and KeyError: 'ColGrp2' respectively.

I was able to do it, using lists of tuples like so:

grp_cols = [('ColGrp1', col) for col in df['ColGrp1']]
agg_cols = [('ColGrp2', col) for col in df['ColGrp2']]
df.groupby(grp_cols)[agg_cols].agg(["sum", "mean"])

But that felt inelegant...

Interestingly, creating the following GroupByObject doesn't throw an error:

g = df.groupby("ColGrp1", level=0)["ColGrp2"]

The error is only when I apply an aggregation function -- g.agg(...) as above, or g.sum() which gives a slightly different error (KeyError: "None of [Index(['ColGrp2'], dtype='object')] are in the [columns]").

0 Answers0