0

I want to do a splitting task, but that requires a minimum number of samples per class, so I want to filter a Dataframe by a column that identifies class labels. If the frequency occurrence of the class is below some threshold, then we want to filter that out.

>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
   0  1  2
0  1  2  3
1  4  5  6
2  0  0  6

>>> filter_on_col(df, col=2, threshold=6)  # Removes first row
   0  1  2
0  4  5  6
1  0  0  6

I can do something like df[2].value_counts() to get frequency of each value in column 2, and then I can figure out which values exceed my threshold simply by:

>>>`df[2].value_counts() > 2`
 3      False
 6      True

and then the logic for figuring out the rest is pretty easy.

But I feel like there's an elegant, Pandas one-liner here that I can do, or maybe a more efficient method.

My question is pretty similar to: Select rows from a DataFrame based on values in a column in pandas, but the tricky part is that I'm relying on value frequency rather than the values themselves.

eyllanesc
  • 221,139
  • 17
  • 121
  • 189
Dave Liu
  • 652
  • 1
  • 8
  • 25
  • Am I missing something or does `df[df[2] >= 6]` not work? – cs95 Jun 06 '19 at 23:15
  • 1
    If I understand you correctly, you are looking for `df[df.groupby(2)[2].transform('size') > 6]` – Erfan Jun 06 '19 at 23:16
  • @cs95 No, that doesn't, because it only handles one number (6), but what if there are other values that occur more than twice? Sorry, my example had a bug. – Dave Liu Jun 06 '19 at 23:19
  • @Erfan Yes, that's exactly what I was looking for! If this question reopens, I'll gladly accept a formal answer post from you. – Dave Liu Jun 06 '19 at 23:29
  • @cs95 Also, you're getting the value of the column. "the tricky part is that I'm relying on value frequency rather than the values themselves" – Dave Liu Jun 06 '19 at 23:30
  • 1
    @Erfan I've reopened the question, go for it – cs95 Jun 06 '19 at 23:39

1 Answers1

1

So this is a one-liner:

# Assuming the parameters of your specific example posed above.
col=2; thresh=2

df[df[col].isin(df[col].value_counts().get(thresh).loc[lambda x : x].index)]

Out[303]: 
   0  1  2
1  4  5  6
2  0  0  6

Or another one-liner:

df[df.groupby(col)[col].transform('count')>thresh,]
Dave Liu
  • 652
  • 1
  • 8
  • 25
BENY
  • 296,997
  • 19
  • 147
  • 204