1

I have a dataframe with two columns ID and Salary

data = {'ID':[1,2,3,4,2],'salary':[1e3,1.2e3,1e3,2e3,1.5e3]}

+----+--------+  
| ID | salary |  
+----+--------+  
| 1  | 1000.0 |  
+----+--------+  
| 2  | 1200.0 |  
+----+--------+  
| 3  | 1000.0 |  
+----+--------+      
| 4  | 2000.0 |     
+----+--------+      
| 2  | 1500.0 |  
+----+--------+  

In this data frame some ID are duplicated, ID=2 in this case. I want to keep the highest salary for each duplicate ID.

+----+--------+  
| ID | salary |  
+----+--------+  
| 1  | 1000.0 |  
+----+--------+  
| 2  | 1500.0 |  
+----+--------+  
| 3  | 1000.0 |  
+----+--------+      
| 4  | 2000.0 |     
+----+--------+      
Zeugma
  • 29,409
  • 8
  • 62
  • 77

3 Answers3

3

You can perform a groupby on 'ID' and take the maximum:

df = df.groupby('ID', as_index=False)['salary'].max()

The resulting output:

   ID  salary
0   1  1000.0
1   2  1500.0
2   3  1000.0
3   4  2000.0
root
  • 29,713
  • 5
  • 67
  • 80
3

This question is a duplicate from Python : Getting the Row which has the max value in groups using groupby

Here's a one-liner:

df.groupby('ID', sort=False)['salary'].max()

And here's another one:

df.sort_values('salary', ascending=False).drop_duplicates(['ID'])    
Community
  • 1
  • 1
rafaelvalle
  • 6,173
  • 3
  • 32
  • 34
1

a more general solution for dataframes with more than two columns in which you want all rows with the max salary per id.

data = {'ID':[1,2,3,4,2],'salary':[1e3,1.2e3,1e3,2e3,1.5e3]}
df = pd.DataFrame(data).assign(more_data=range(5))

df.loc[df.groupby('ID').salary.idxmax()]

   ID  salary  more_data
0   1  1000.0          0
4   2  1500.0          4
2   3  1000.0          2
3   4  2000.0          3
piRSquared
  • 265,629
  • 48
  • 427
  • 571