2

I am new to python. Currently, I have a dataset that contains more than 40 columns needed to be grouped and aggregated. I was wondering if python has a function similar to cube() function in SQL. If not, how can I have the expected output? Really appreciate it if someone could answer for me. Thanks!

Below Example that I simplified to 2 columns only (Country & Attribute A):

| CustomerID | Country |Attribute A|Sales| No.of product| No. of transaction|
| ---------- | --------|-----------|-----|--------------|-------------------|
| 1          | US      |A          |20   |2             |2                  |
| 2          | US      |B          |25   |3             |3                  |
|3           |CA       |A          |100  |10            |10                 |
|4           |CA       |B          |50   |5             |5                  |
|5           |UK       |A          |40   |4             |4                  |

Expected Output:

| Country|Attribute A|Sum of Sales|Total no. of product| Total no. of transaction| Total no. of customer|
|--------|-----------|------------|--------------------|-------------------------|----------------------|
|US      |(null)     |45          |5                   |5                        |2                     |
|CA      |(null)     |150         |15                  |155                      |2                     |
|UK      |(null)     |40          |4                   |4                        |1                     |
|(null)  |A          |160         |16                  |16                       |3                     |
|(null)  |B          |75          |8                   |8                        |2                     |
|US      |A          |20          |2                   |2                        |1                     |
|US      |B          |25          |3                   |3                        |1                     |
|CA      |A          |100         |10                  |10                       |1                     |
|CA      |B          |50          |5                   |5                        |1                     |
|UK      |A          |40          |4                   |4                        |1                     |
  • 1
    Could you describe what exactly is the cube function in SQL – Prats Feb 02 '22 at 14:38
  • @Prats The "expected output" gives a sense of what's going on. It's a "groupby" operation over multiple columns whereby the result (in this case the sum) is shown for every *subset* of values for the two columns of interest. The `US ... (null)` row, for instance, gives sums for over all entries with Country=US. It looks like the expected output is missing a `(null) (null)` row, which would correspond to the overall total. – Ben Grossmann Feb 02 '22 at 15:00
  • @BenGrossmann Yes. Sorry for missing that combination in the output. Do you have any idea how python can achieve this expected output with a simple function? I saw the comments, it seems it would be complicated to do that in python because my real dataset has much more columns than the dataset in this example. – aaah_ooooh9305 Feb 03 '22 at 03:24
  • I don't know of any such function. I've come up with a way to make this work in general, though. – Ben Grossmann Feb 03 '22 at 04:37

3 Answers3

1

Here's an extension of the other two answers that generalizes to arbitrarily many columns.

import pandas as pd
from itertools import combinations
    
def cube_sum(df,cols):
    dfs = []
    for n in range(len(cols),0,-1):
        for subset in combinations(cols,n):
            dfs.append(df.groupby(list(subset)).sum().reset_index())
    dfs.append(df.drop(cols, axis = 1).sum().to_frame().T)
    return pd.concat(dfs)


df = pd.DataFrame(
    {'CustomerID':[1,2,3,4,5],
    'Country':['US','US','CA','CA','UK'],
    'Attribute A':list("ABABA"),
    'Sales':[20,50,100,50,40],
    'No. of Product':[2,3,10,5,4],
    'No. of Transaction':[2,3,10,5,4]}).set_index('CustomerID')
all_groups = cube_sum(df,["Attribute A","Country"])

The result:

  Attribute A Country  Sales  No. of Product  No. of Transaction
0           A      CA    100              10                  10
1           A      UK     40               4                   4
2           A      US     20               2                   2
3           B      CA     50               5                   5
4           B      US     50               3                   3
0           A     NaN    160              16                  16
1           B     NaN    100               8                   8
0         NaN      CA    150              15                  15
1         NaN      UK     40               4                   4
2         NaN      US     70               5                   5
0         NaN     NaN    260              24                  24
Ben Grossmann
  • 1,931
  • 1
  • 9
  • 11
0

I have updated my answer based on Acccumulation's answer reset_index(). I also added sort=True to avoid getting warnings, FutureWarning: Sorting because non-concatenation axis is not aligned.. credit

import pandas as pd

data = {
    "Country": ["US", "US", "CA", "CA", "UK"],
    "Attribute A": ["A", "B", "A", "B", "A"],
    "Sales": [20, 25, 100, 50, 40],
    "No.of product": [2, 3, 10, 5, 4],
    "No. of transaction": [2, 3, 10, 5, 4],
}
df = pd.DataFrame(data)
country_null = df.groupby(["Attribute A"]).sum().reset_index()
attr_a_null = df.groupby(["Country"]).sum().reset_index()
both_groups = df.groupby(["Country", "Attribute A"]).sum().reset_index()
all_groups = both_groups.append(country_null, sort=True).append(attr_a_null, sort=True)

print(all_groups)
Attribute A Country  No. of transaction  No.of product  Sales
0           A      CA                  10             10    100
1           B      CA                   5              5     50
2           A      UK                   4              4     40
3           A      US                   2              2     20
4           B      US                   3              3     25
0           A     NaN                  16             16    160
1           B     NaN                   8              8     75
0         NaN      CA                  15             15    150
1         NaN      UK                   4              4     40
2         NaN      US                   5              5     45
A D
  • 448
  • 1
  • 4
  • 12
  • Thanks. So there should be no simple function to achieve the expected output? Actually, in my case, I have over 40 columns needed to be grouped. And the output would be much more complicated. Let's say I added one more column, Attribute B, to the sample dataset. there should have more combinations in the output. If I follow the code you suggested, I need to consider all possible outcomes and write them into df.groupby one by one? – aaah_ooooh9305 Feb 03 '22 at 03:21
0

There are some answers here: Pandas Pivot tables row subtotals

A D's answer can be modified to get output that more closely matches what you asked for:

country_null = df.groupby(["Attribute A"]).sum().reset_index()
attr_a_null = df.groupby(["Country"]).sum().reset_index()
both_groups = df.groupby(["Country", "Attribute A"]).sum().reset_index()
all_groups = both_groups.append(country_null).append(attr_a_null)
Acccumulation
  • 3,215
  • 1
  • 6
  • 11