2

I have a dataframe with duplicate rows except for one value. I want to filter them out and only keep the row with the higher value.

User_ID - Skill - Year_used
1 - skill_a - 2017
1 - skill_b - 2015
1 - skill_a - 2018
2 - skill_c - 2011

etc.

So for example rows with skill_a and the same User_ID need to be compared and only the one with the latest year should be kept.

transform.('count')

Only gives me the amount of rows of the group by User_ID.

value_counts()

Only gives me a series I can't merge back to the df.

Nay ideas?

Thank you

Sam_Ste
  • 205
  • 2
  • 13

2 Answers2

1

One option is to groupby the Skill and keep the max Year_used:

df.groupby(['User_ID','Skill']).Year_used.max().reset_index()

     User_ID    Skill  Year_used
0        1  skill_a       2018
1        1  skill_b       2015
2        2  skill_c       2011
yatu
  • 80,714
  • 11
  • 64
  • 111
1

You can use drop_duplicates by sorting a column to keep max

df = df.sort_values('Year_used').drop_duplicates(['User_ID','Skill'], keep='last')
Sociopath
  • 12,395
  • 17
  • 43
  • 69