0

I have a large df in which column 'cik' specifies a group (group elements cik1, cik2, etc.) and column 'ddate' specifies a date. Each 'cik' element can occur hundreds of times, and the 'ddate' for each element can differ between rows.

I'm trying to drop all rows in which ddate is not the most recent date for each group. That means I want to keep ALL rows, for which the date is the most recent for each group.

The df looks like this:

df_num_target = pd.DataFrame([['cik1', 20220331], ['cik1', 20210331], ['cik1', 20200331], ['cik2', 20220519], ['cik2', 20220519], ['cik2', 20220519], ['cik3', 20221224], ['cik3', 20221224], ['cik3', 20190331]], columns=['cik', 'ddate'])

    cik     ddate
0   cik1    20220331
1   cik1    20210331
2   cik1    20200331
3   cik2    20220519
4   cik2    20220519
5   cik2    20220519
6   cik3    20221224
7   cik3    20221224
8   cik3    20190331

Expected output for 'df_drop':

    cik     ddate
0   cik1    20220331
3   cik2    20220519
4   cik2    20220519
5   cik2    20220519
6   cik3    20221224
7   cik3    20221224

Note that for 'cik1' only one row remains, for 'cik2' three remain, and for 'cik3' two remain.

(This seemed a very similar question to mine, but the answers there didn't solve my problem: Get the row(s) which have the max value in groups using groupby Applied to my df, the answers there yield only a single row per group whereas I need to yield many rows per group)

0 Answers0