Cumulative Set in PANDAS

Question

I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format:

In [3]: df
Out[3]: 
            screen_name 
created_at  
04-01-16    Bob 
04-01-16    Bob
04-01-16    Sally
04-01-16    Sally
04-02-16    Bob
04-02-16    Miguel
04-02-16    Tim

I can collapse the dataset by date and get a column with the unique users per day:

In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x)))

Out[4]:             from_user_screen_name
        2016-04-02  {Bob, Sally}
        2016-04-03  {Bob, Miguel, Tim}

So far so good. But what I'd like is to have a "cumulative set" like this:

Out[4]:             Cumulative_list_up_to_this_date   Cumulative_number_of_unique_users
        2016-04-02  {Bob, Sally}                      2
        2016-04-03  {Bob, Sally, Miguel, Tim}         4

Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.

score 8 · Accepted Answer · answered Sep 21 '16 at 17:43

8

You cannot add sets, but can add lists! So build a list of users, then take the cumulative sum and finally apply the set constructor to get rid of duplicates.

cum_names = (df['screen_name'].groupby(df.index.date)
                              .agg(lambda x: list(x))
                              .cumsum()
                              .apply(set))
# 2016-04-01                 {Bob, Sally}
# 2016-04-02    {Bob, Miguel, Tim, Sally}
# dtype: object

cum_count = cum_names.apply(len)
# 2016-04-01    2
# 2016-04-02    4
# dtype: int64

answered Sep 21 '16 at 17:43

Alicia Garcia-Raboso

11,843
1
40
45

That is brilliant! I didn't know that the *cumsum( )* function would create a cumulative list. Exactly what I needed. Thanks! – Gregory Saxton Sep 21 '16 at 17:57
2

This answer no longer works since the list operation in the agg results in a "ValueError: Function does not reduce". Using .apply(list) does work. – Brian Keegan Jan 06 '18 at 01:03
@BrianKeegan Could you provide more info on what version of pandas breaks the above approach? We may be able to update my answer to reflect that. – Alicia Garcia-Raboso Jan 07 '18 at 18:46
This is one of the best finds I have seen so far for pandas! – RDizzl3 May 20 '21 at 18:12

Cumulative Set in PANDAS

1 Answers1

Linked