Subset of dataframe according to Count value of column

Question

Table(df):

customer_id    Order_date 
    1       2015-01-16      
    1       2015-01-19      
    2       2014-12-21      
    2       2015-01-10      
    1       2015-01-10
    3       2018-01-18
    3       2017-03-04
    4       2019-11-05
    4       2010-01-01
    3       2019-02-03
    3       2020-01-01
    3       2018-01-01

Output needed: A subset of the df where customer_IDs have more than 3 order_dates. (skipping 2,4 and 5 customer ID since they have less than 3 order_dates)

 Customer_id  Number_of_Order_dates
     1          3
     3          5

I have tried groupby but it has not worked to create a subset. Please help.

Codes tried so far which failed:

df[df['days'].count()>3]

and one more I tried which is incorrect:

 df1=df.groupby('customer_id')['order_date'].count()
 df[df1.iloc[:,1]]

Could you post your `groupby` code attempt? – sshashank124 Dec 27 '19 at 16:11 — sshashank124, Dec 27 '19 at 16:11

score 6 · Accepted Answer · answered Dec 27 '19 at 16:20

6

IIUC

df.groupby('customer_id')['Order_date'].nunique().loc[lambda x : x>=3].reset_index()
Out[94]: 
   customer_id  Order_date
0            1           3
1            3           5

answered Dec 27 '19 at 16:20

BENY

296,997
19
147
204

1

this is much much better than mine – anky Dec 27 '19 at 16:21

score 4 · Answer 2 · answered Dec 27 '19 at 16:14

You can use:

df.groupby('customer_id').filter(lambda x: 
     (x['Order_date'].nunique()>=3)).groupby('customer_id').count()

Or:

(df[df.groupby('customer_id')['Order_date'].transform('nunique').ge(3)]
   .groupby('customer_id').count())

             Order_date
customer_id            
1                     3
3                     5

score 2 · Answer 3 · answered Dec 27 '19 at 16:24

2

Using GroupBy.nunique with DataFrame.query:

df.groupby('customer_id')['Order_date'].nunique().reset_index().query('Order_date >= 3')

   customer_id  Order_date
0            1           3
2            3           5

answered Dec 27 '19 at 16:24

Erfan

36,997
6
53
68

score 1 · Answer 4 · answered Dec 27 '19 at 17:09

with `dict`

d = {}
for c, o in zip(*map(df.get, df)):
    d.setdefault(c, set()).add(o)

pd.DataFrame(
    [(c, len(o)) for c, o in d.items() if len(o) >= 3],
    columns=[*df]
)

   customer_id  Order_date
0            1           3
1            3           5

with `pd.factorize` and `np.bincount`

i, u = df.drop_duplicates().customer_id.factorize()
c = np.bincount(i)

pd.DataFrame(
    [(u_, c_) for u_, c_ in zip(u, c) if c_ > 2],
    columns=[*df]
)

   customer_id  Order_date
0            1           3
1            3           5

score 0 · Answer 5 · answered Dec 27 '19 at 16:14

0

a brute force approach is to add the groupby as a new column (pointers), with a name like num_dates and then restrict the overall df like this:

result = my_df[my_df['num_dates'] > 3]

answered Dec 27 '19 at 16:14

treetopper

36
4

Subset of dataframe according to Count value of column

5 Answers5

with dict

with pd.factorize and np.bincount

with `dict`

with `pd.factorize` and `np.bincount`