0

I am trying to sort data by the Name column, by popularity.

Right now, I'm doing this:

df['Count'] = df.apply(lambda x: len(df[df['Name'] == x['Name']]), axis=1)
df[df['Count'] > 50][['Name', 'Description', 'Count']].drop_duplicates('Name').sort_values('Count', ascending=False).head(100)

However this query is very slow, it takes hours to run.

What would be a more efficient way to do this?

if __name__ is None
  • 10,423
  • 17
  • 53
  • 69

3 Answers3

2

The solution I have been looking for is:

df['Count'] = df.groupby('Name')['Name'].transform('count')

Big thanks to @Lynob for providing a link with an answer.

if __name__ is None
  • 10,423
  • 17
  • 53
  • 69
1

You can use Series.value_counts.

df = pd.DataFrame([[0, 1], [1, 0], [1, 1]], columns=['a', 'b'])
print(df['b'].value_counts())

outputs

1    2
0    1
Name: b, dtype: int64
Alex
  • 17,062
  • 7
  • 54
  • 78
0

Try this:

a = ["jim"]*5  + ["jane"]*10 + ["john"]*15 
n = pd.Series(a)

sorted((n.value_counts()[n.value_counts() > 5]).index)

['jane', 'john']
Merlin
  • 22,195
  • 35
  • 117
  • 197