How to use pandas to groupby and aggregate the excel table in this example?

Question

I have two tables, these two tables are the data of sales staff and sales managers from January to February:

The name of table is 【1】commission--finish.xlsx, the contents of the January table are as follows ：

The name of table is 【2】commission--finish.xlsx, the contents of the February table are as follows:

I want to calculate the sum of the Indicators and the sum of ActualSales of the GeneralRepresentation is in the hospital he is in charge of.

The code is :

import os
import pandas as pd
import openpyxl

sales_catalog = r"D:\\Hospital"

df_one = pd.DataFrame()

monthlist = []

for root,dir,files in os.walk(sales_catalog):
    for file in files:
        if file.endswith('finish.xlsx') and file.startswith('【'):
            excel_input_address = os.path.join(sales_catalog, file)
            monthlist.append(file[1:-24])
            df_one = df_one.append(pd.read_excel(excel_input_address, sheet_name = 'sheet1', skiprows=2))
        else:
            pass

monthlist.sort(key=int)
lastmonth = monthlist[-1]

df_one = df_one.loc[:,['City', 'MarketingManager', 'GeneralRepresentation', 'Hospital', 'Indicators', 'Examination\nBilling', 'Digestion\nBilling']].dropna(axis=0,subset = ['Hospital'])

df_two = df_one.groupby(['Hospital']).agg({'Indicators': 'sum', 'Examination\nBilling': 'sum', 'Digestion\nBilling': 'sum'}).reset_index().rename(columns={'Indicators': 'Indicators∑（1-' + lastmonth + 'month）'})

df_two['ActualSales∑（1-' + lastmonth + 'month）'] = df_two['Examination\nBilling'] + df_two['Digestion\nBilling']

df_two = df_two.loc[:,['Hospital', 'Indicators∑（1-' + lastmonth + 'month）', 'ActualSales∑（1-' + lastmonth + 'month）']].dropna(axis=0,subset = ['Hospital'])

df_three = df_one.loc[:,['City', 'Hospital', 'MarketingManager', 'GeneralRepresentation']].dropna(axis=0,subset = ['Hospital']).drop_duplicates()

df_four = pd.merge(df_two,df_three,on="Hospital",how="outer")

df_four = df_four.loc[:,['City', 'Hospital', 'Indicators∑（1-' + lastmonth + 'month）', 'ActualSales∑（1-' + lastmonth + 'month）', 'MarketingManager', 'GeneralRepresentation']]

init_month_column = 1

for month in monthlist:
    file = "【" + month + "】commission--finish.xlsx"
    excel_address = os.path.join(sales_catalog, file)
    df_months = pd.read_excel(excel_address, sheet_name = 'sheet1', skiprows=2)
    df_months = df_months.loc[:,['Hospital', 'Examination\nBilling', 'Digestion\nBilling']].dropna(axis=0,subset = ['Hospital'])
    df_months.drop_duplicates(subset=['Hospital'],keep='first',inplace=True)
    df_months = df_months.rename(columns={'Digestion\nBilling':'Digestion', 'Examination\nBilling':'Examination'})
    df_months.insert(init_month_column, month + "month", df_months['Digestion'] + df_months['Examination'])
    df_four = pd.merge(df_four,df_months,on='Hospital',how="outer")

df_four = df_four.rename(columns={"Digestion_x":"Digestion","Examination_x":"Examination","Digestion_y":"Digestion","Examination_y":"Examination"})


try:
    df_four.to_excel('D:\\Hospital\\SalesCommission.xlsx', sheet_name='sheet1')
    print('The data has been saved in the table, please check the table!')
except PermissionError as e:
    print('*****Note: The current data is not saved in the table! Reason: The table has been opened, please close the table and execute the program before the data can be saved!*****')

After I run the code, the result is as follows:

The above result table data is incorrect. When there are multiple GeneralRepresentation in the same hospital, the data of Indicators∑(1-2month) and ActualSales∑(1-2month) is wrong, so it becomes the sum of Indicators and ActualSales of multiple GeneralRepresentation. At the same time, the monthly Examination and Digestion are also wrong.

The result table I want is as follows, :

The Indicators∑(1-2 month), ActualSales∑(1-2month), the monthly Examination and Digestion data of the GeneralRepresentation of the same hospital must be calculated separately.

May I ask if there is a problem with my line of code?

df_two = df_one.groupby(['Hospital']).agg({'Indicators':'sum','Examination\nBilling':'sum','Digestion\nBilling ':'sum'}).reset_index().rename(columns={'Indicators':'Indicators∑(1-' + lastmonth +'month)'})

How should I modify the code to achieve my purpose?

[jfyi](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Danila Ganchar, Dec 29 '21 at 15:36

How to use pandas to groupby and aggregate the excel table in this example?

0 Answers0