Select rows based on frequency of values in a column; one-liner or faster way?

Question

I want to do a splitting task, but that requires a minimum number of samples per class, so I want to filter a Dataframe by a column that identifies class labels. If the frequency occurrence of the class is below some threshold, then we want to filter that out.

>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
   0  1  2
0  1  2  3
1  4  5  6
2  0  0  6

>>> filter_on_col(df, col=2, threshold=6)  # Removes first row
   0  1  2
0  4  5  6
1  0  0  6

I can do something like df[2].value_counts() to get frequency of each value in column 2, and then I can figure out which values exceed my threshold simply by:

>>>`df[2].value_counts() > 2`
 3      False
 6      True

and then the logic for figuring out the rest is pretty easy.

But I feel like there's an elegant, Pandas one-liner here that I can do, or maybe a more efficient method.

My question is pretty similar to: Select rows from a DataFrame based on values in a column in pandas, but the tricky part is that I'm relying on value frequency rather than the values themselves.

If I understand you correctly, you are looking for `df[df.groupby(2)[2].transform('size') > 6]` — Erfan, Jun 06 '19 at 23:16
@cs95 No, that doesn't, because it only handles one number (6), but what if there are other values that occur more than twice? Sorry, my example had a bug. — Dave Liu, Jun 06 '19 at 23:19
@Erfan Yes, that's exactly what I was looking for! If this question reopens, I'll gladly accept a formal answer post from you. — Dave Liu, Jun 06 '19 at 23:29
@cs95 Also, you're getting the value of the column. "the tricky part is that I'm relying on value frequency rather than the values themselves" — Dave Liu, Jun 06 '19 at 23:30

score 1 · Accepted Answer · edited Jun 18 '19 at 17:56

1

So this is a one-liner:

# Assuming the parameters of your specific example posed above.
col=2; thresh=2

df[df[col].isin(df[col].value_counts().get(thresh).loc[lambda x : x].index)]

Out[303]: 
   0  1  2
1  4  5  6
2  0  0  6

Or another one-liner:

df[df.groupby(col)[col].transform('count')>thresh,]

edited Jun 18 '19 at 17:56

Dave Liu

652
1
8
25

answered Jun 07 '19 at 01:45

BENY

296,997
19
147
204

Select rows based on frequency of values in a column; one-liner or faster way?

1 Answers1