With sample data and code below, I'm trying to groupby year-month and find top K columns having smallest absolute values with target columns in the end of each month:
import pandas as pd
import numpy as np
from statistics import stdev
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
k = 3 # set k as 3
value_cols = df.columns[df.columns.str.endswith('_values')]
For example, if we set k=3, then for February 2013, we find A_values, B_values and E_values having smallest abs values with target, then using them as abs_1, abs_2 and abs_3 for this month. Same logic for other year-month. For rows which have on data for last day of month, then results for that month will be NaNs since we only take last day's data to calculate abs() values.
My trial code, but it does not calculate the abs values of last day of each month. Also it raises an error: TypeError: nsmallest() missing 1 required positional argument: 'columns'.
def calculate_abs(x1, x2):
x2 = np.array(x2)
res = np.mean(np.abs(x1 - x2), axis=0) # Here np.mean for each month, but I only need to take last day's difference abs
# print(res)
return res
def find_topK_smallest_abs(group):
abss = calculate_abs(group[value_cols], group[['target']])
print(abss)
cols = abss.nsmallest(k).index
out_cols = [f'abs_{i+1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df.groupby(pd.Grouper(freq='M')).apply(find_topK_smallest_abs)
How could I achieve that correctly? Should I use resample('M')? Sincere thanks at advance.
EDIT: to filter last day of each month:
df.loc[df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())]
Out:
A_values B_values C_values D_values E_values target
2013-02-28 9.623960 6.165574 0.866300 5.612724 6.165247 9.638430
2013-03-31 2.717608 6.296843 0.087207 1.681396 4.635223 9.347712
2013-04-30 1.849324 2.510323 4.442528 1.106902 2.794064 5.927628
2013-05-26 4.930571 2.338785 9.823048 6.893207 6.312104 5.273122
Reference link: