Remove duplicate rows in pandas dataframe based on condition

Question

            is_avail   valu data_source
2015-08-07     False  0.282    source_a
2015-08-07     False  0.582    source_b
2015-08-23     False  0.296    source_a
2015-09-08     False  0.433    source_a
2015-10-01      True  0.169    source_b

In the dataframe above, I want to remove the duplicate rows (i.e. row where the index is repeated) by retaining the row with a higher value in the valu column.

I can remove rows with duplicate indexes like this:

df = df[~df.index.duplicated()]. But how to remove based on condition specified above?

this might help: http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries — Paul H, May 05 '17 at 22:21

score 8 · Accepted Answer · answered May 05 '17 at 22:21

8

You can use groupby on index after sorting the df by valu.

df.sort_values(by='valu', ascending=False).groupby(level=0).first()
Out[1277]: 
           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b

answered May 05 '17 at 22:21

Allen Qin

18,332
6
47
59

This is the better solution! However, you don't need `ascending=False`. I'd change it to `df.sort_values('valu').groupby(level=0).tail(1)` – piRSquared May 05 '17 at 22:36

score 5 · Answer 2 · answered May 05 '17 at 22:21

Using drop_duplicates with keep='last'

df.rename_axis('date').reset_index() \
    .sort_values(['date', 'valu']) \
    .drop_duplicates('date', keep='last') \
    .set_index('date').rename_axis(df.index.name)

           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b

Remove duplicate rows in pandas dataframe based on condition

2 Answers2

Linked