Remove duplicates values on one column in a data frame using conditions

Question

I have a dataframe with two columns ID and Salary

data = {'ID':[1,2,3,4,2],'salary':[1e3,1.2e3,1e3,2e3,1.5e3]}

+----+--------+  
| ID | salary |  
+----+--------+  
| 1  | 1000.0 |  
+----+--------+  
| 2  | 1200.0 |  
+----+--------+  
| 3  | 1000.0 |  
+----+--------+      
| 4  | 2000.0 |     
+----+--------+      
| 2  | 1500.0 |  
+----+--------+

In this data frame some ID are duplicated, ID=2 in this case. I want to keep the highest salary for each duplicate ID.

+----+--------+  
| ID | salary |  
+----+--------+  
| 1  | 1000.0 |  
+----+--------+  
| 2  | 1500.0 |  
+----+--------+  
| 3  | 1000.0 |  
+----+--------+      
| 4  | 2000.0 |     
+----+--------+

score 3 · Answer 1 · answered Jan 11 '17 at 00:40

3

You can perform a groupby on 'ID' and take the maximum:

df = df.groupby('ID', as_index=False)['salary'].max()

The resulting output:

   ID  salary
0   1  1000.0
1   2  1500.0
2   3  1000.0
3   4  2000.0

answered Jan 11 '17 at 00:40

root

29,713
5
67
80

score 3 · Answer 2 · edited May 23 '17 at 11:52

3

This question is a duplicate from Python : Getting the Row which has the max value in groups using groupby

Here's a one-liner:

df.groupby('ID', sort=False)['salary'].max()

And here's another one:

df.sort_values('salary', ascending=False).drop_duplicates(['ID'])

edited May 23 '17 at 11:52

Community

1
1

answered Jan 11 '17 at 00:41

rafaelvalle

6,173
3
32
34

score 1 · Accepted Answer · answered Jan 11 '17 at 02:01

a more general solution for dataframes with more than two columns in which you want all rows with the max salary per id.

data = {'ID':[1,2,3,4,2],'salary':[1e3,1.2e3,1e3,2e3,1.5e3]}
df = pd.DataFrame(data).assign(more_data=range(5))

df.loc[df.groupby('ID').salary.idxmax()]

   ID  salary  more_data
0   1  1000.0          0
4   2  1500.0          4
2   3  1000.0          2
3   4  2000.0          3

Remove duplicates values on one column in a data frame using conditions

3 Answers3