3

Table(df):

customer_id    Order_date 
    1       2015-01-16      
    1       2015-01-19      
    2       2014-12-21      
    2       2015-01-10      
    1       2015-01-10
    3       2018-01-18
    3       2017-03-04
    4       2019-11-05
    4       2010-01-01
    3       2019-02-03
    3       2020-01-01
    3       2018-01-01

Output needed: A subset of the df where customer_IDs have more than 3 order_dates. (skipping 2,4 and 5 customer ID since they have less than 3 order_dates)

 Customer_id  Number_of_Order_dates
     1          3
     3          5

I have tried groupby but it has not worked to create a subset. Please help.

Codes tried so far which failed:

df[df['days'].count()>3]

and one more I tried which is incorrect:

 df1=df.groupby('customer_id')['order_date'].count()
 df[df1.iloc[:,1]]
anky
  • 71,373
  • 8
  • 36
  • 64
noob
  • 3,013
  • 2
  • 17
  • 49

5 Answers5

6

IIUC

df.groupby('customer_id')['Order_date'].nunique().loc[lambda x : x>=3].reset_index()
Out[94]: 
   customer_id  Order_date
0            1           3
1            3           5
BENY
  • 296,997
  • 19
  • 147
  • 204
4

You can use:

df.groupby('customer_id').filter(lambda x: 
     (x['Order_date'].nunique()>=3)).groupby('customer_id').count()

Or:

(df[df.groupby('customer_id')['Order_date'].transform('nunique').ge(3)]
   .groupby('customer_id').count())

             Order_date
customer_id            
1                     3
3                     5
anky
  • 71,373
  • 8
  • 36
  • 64
2

Using GroupBy.nunique with DataFrame.query:

df.groupby('customer_id')['Order_date'].nunique().reset_index().query('Order_date >= 3')

   customer_id  Order_date
0            1           3
2            3           5
Erfan
  • 36,997
  • 6
  • 53
  • 68
1

with dict

d = {}
for c, o in zip(*map(df.get, df)):
    d.setdefault(c, set()).add(o)

pd.DataFrame(
    [(c, len(o)) for c, o in d.items() if len(o) >= 3],
    columns=[*df]
)

   customer_id  Order_date
0            1           3
1            3           5

with pd.factorize and np.bincount

i, u = df.drop_duplicates().customer_id.factorize()
c = np.bincount(i)

pd.DataFrame(
    [(u_, c_) for u_, c_ in zip(u, c) if c_ > 2],
    columns=[*df]
)

   customer_id  Order_date
0            1           3
1            3           5
piRSquared
  • 265,629
  • 48
  • 427
  • 571
0

a brute force approach is to add the groupby as a new column (pointers), with a name like num_dates and then restrict the overall df like this:

result = my_df[my_df['num_dates'] > 3]

treetopper
  • 36
  • 4