1

I have one df like this:

>>> df1

        col_1   col_2    size_col  other_col
0        aaa     abc       4          zxc
1        aaa     abc       3          xcv
2        aaa     abc       1          cvb
3        bbb     bbc       7          vbn
4        bbb     bbc       3          bnm
5        ccc     cbc       1          asd
6        ddd     dbc       9          sdf
7        ccc     cbc       3          dfg
8        ccc     cbc       1          fgh

and want a df like this:

>>> df2

        col_1   col_2    size_col  other_col
0        aaa     abc       4          zxc
3        bbb     bbc       7          vbn
6        ddd     dbc       9          sdf
7        ccc     cbc       3          dfg

Explanation:
I want to all drop the where col_1 and col_2 have similar values, and retain the rows where 'size_col' is greatest for all the duplicate bunch. so, from above example, for the rows, where col_1 and col_2 has aaa and abc, I need to retain the row where size_col has biggest value. or put other way, i need to group by col_1 and col_2 columns, then for each group, retain only the row where other_col have biggest value for the group.

How do I do this efficiently for a df with around 5 million rows and 7 columns?

Naveen Reddy Marthala
  • 1,812
  • 2
  • 18
  • 44

1 Answers1

2

Use:

df1.loc[df1.groupby(['col_1', 'col_2'])['size_col'].idxmax()]
jezrael
  • 729,927
  • 78
  • 1,141
  • 1,090
U12-Forward
  • 65,118
  • 12
  • 70
  • 89