0

With sample data and code below, I'm trying to groupby year-month and find top K columns having smallest absolute values with target columns in the end of each month:

import pandas as pd
import numpy as np
from statistics import stdev

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])

k = 3    # set k as 3

value_cols = df.columns[df.columns.str.endswith('_values')]

For example, if we set k=3, then for February 2013, we find A_values, B_values and E_values having smallest abs values with target, then using them as abs_1, abs_2 and abs_3 for this month. Same logic for other year-month. For rows which have on data for last day of month, then results for that month will be NaNs since we only take last day's data to calculate abs() values.

My trial code, but it does not calculate the abs values of last day of each month. Also it raises an error: TypeError: nsmallest() missing 1 required positional argument: 'columns'.

def calculate_abs(x1, x2):
    x2 = np.array(x2)
    res = np.mean(np.abs(x1 - x2), axis=0) # Here np.mean for each month, but I only need to take last day's difference abs
    # print(res)
    return res

def find_topK_smallest_abs(group):
    abss = calculate_abs(group[value_cols], group[['target']])
    print(abss)
    cols = abss.nsmallest(k).index
    out_cols = [f'abs_{i+1}' for i in range(k)]
    rv = group.loc[:, cols]
    rv.columns = out_cols
    return rv

df.groupby(pd.Grouper(freq='M')).apply(find_topK_smallest_abs)

How could I achieve that correctly? Should I use resample('M')? Sincere thanks at advance.

EDIT: to filter last day of each month:

df.loc[df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())]

Out:

            A_values  B_values  C_values  D_values  E_values    target
2013-02-28  9.623960  6.165574  0.866300  5.612724  6.165247  9.638430
2013-03-31  2.717608  6.296843  0.087207  1.681396  4.635223  9.347712
2013-04-30  1.849324  2.510323  4.442528  1.106902  2.794064  5.927628
2013-05-26  4.930571  2.338785  9.823048  6.893207  6.312104  5.273122

Reference link:

Groupby year-month and drop columns with all NaNs in Python

ah bon
  • 7,903
  • 7
  • 43
  • 86

0 Answers0