I have a large df in which column 'cik' specifies a group (group elements cik1, cik2, etc.) and column 'ddate' specifies a date. Each 'cik' element can occur hundreds of times, and the 'ddate' for each element can differ between rows.
I'm trying to drop all rows in which ddate is not the most recent date for each group. That means I want to keep ALL rows, for which the date is the most recent for each group.
The df looks like this:
df_num_target = pd.DataFrame([['cik1', 20220331], ['cik1', 20210331], ['cik1', 20200331], ['cik2', 20220519], ['cik2', 20220519], ['cik2', 20220519], ['cik3', 20221224], ['cik3', 20221224], ['cik3', 20190331]], columns=['cik', 'ddate'])
cik ddate
0 cik1 20220331
1 cik1 20210331
2 cik1 20200331
3 cik2 20220519
4 cik2 20220519
5 cik2 20220519
6 cik3 20221224
7 cik3 20221224
8 cik3 20190331
Expected output for 'df_drop':
cik ddate
0 cik1 20220331
3 cik2 20220519
4 cik2 20220519
5 cik2 20220519
6 cik3 20221224
7 cik3 20221224
Note that for 'cik1' only one row remains, for 'cik2' three remain, and for 'cik3' two remain.
(This seemed a very similar question to mine, but the answers there didn't solve my problem: Get the row(s) which have the max value in groups using groupby Applied to my df, the answers there yield only a single row per group whereas I need to yield many rows per group)