Is it possible to succinctly perform a groupby and aggregation on a DataFrame whose columns are a MultiIndex?
For instance, on the DataFrame below:
cols = pd.MultiIndex.from_tuples([("ColGrp1", "col1"),
("ColGrp1", "col2"),
("ColGrp2", "agg1"),
("ColGrp2", "agg2")
])
df = pd.DataFrame([["a", "a", "a", "b", "b"],
["c", "c", "c", "d", "d"],
[1, 2, 3, 4, 5],
[10, 20, 30, 40, 50]]
).T
df.columns = cols
Which looks like:
ColGrp1 ColGrp2
col1 col2 agg1 agg2
0 a c 1 10
1 a c 2 20
2 a c 3 30
3 b d 4 40
4 b d 5 50
If we then wanted to group by all the columns of ColGrp1 while aggregating all the columns of ColGrp2. Say, taking their sum & mean -- so the result would look like:
ColGrp2
agg1 agg2
sum mean sum mean
(ColGrp1, col1) (ColGrp1, col2)
a c 6 2.0 60 20.0
b d 9 4.5 90 45.0
I'm wondering if there's a succinct syntax for this? I had hoped something like the below would work:
df.groupby("ColGrp1")["ColGrp2"].agg(["sum", "mean"])
Or:
df.groupby("ColGrp1", level=0)["ColGrp2"].agg(["sum", "mean"])
However these both give errors -- ValueError: Grouper for 'ColGrp1' not 1-dimensional and KeyError: 'ColGrp2' respectively.
I was able to do it, using lists of tuples like so:
grp_cols = [('ColGrp1', col) for col in df['ColGrp1']]
agg_cols = [('ColGrp2', col) for col in df['ColGrp2']]
df.groupby(grp_cols)[agg_cols].agg(["sum", "mean"])
But that felt inelegant...
Interestingly, creating the following GroupByObject doesn't throw an error:
g = df.groupby("ColGrp1", level=0)["ColGrp2"]
The error is only when I apply an aggregation function -- g.agg(...) as above, or g.sum() which gives a slightly different error (KeyError: "None of [Index(['ColGrp2'], dtype='object')] are in the [columns]").