131

I would like to add a cumulative sum column to my Pandas dataframe so that:

name | day       | no
-----|-----------|----
Jack | Monday    | 10
Jack | Tuesday   | 20
Jack | Tuesday   | 10
Jack | Wednesday | 50
Jill | Monday    | 40
Jill | Wednesday | 110

becomes:

Jack | Monday     | 10  | 10
Jack | Tuesday    | 30  | 40
Jack | Wednesday  | 50  | 90
Jill | Monday     | 40  | 40
Jill | Wednesday  | 110 | 150

I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.

kc2819
  • 1,323
  • 2
  • 9
  • 5
  • Are you really sure that you want aggregation over week days? That loses the index, and also the cumulative sum makes less sense if there are multiple weeks. The answers by dmitry-andreev and @vjayky calculates cumsum over the sequence of days for each name instead. Think of how this could be extended if there were a date column too, that the entries could be sorted by before grouping and aggregating. – Elias Hasle Nov 08 '18 at 08:09

6 Answers6

126

This should do it, need groupby() twice:

df.groupby(['name', 'day']).sum() \
  .groupby(level=0).cumsum().reset_index()

Explanation:

print(df)
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

# sum per name/day
print( df.groupby(['name', 'day']).sum() )
                 no
name day           
Jack Monday      10
     Tuesday     30
     Wednesday   50
Jill Monday      40
      Wednesday  110

# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
         .groupby(level=0).cumsum() )
                 no
name day           
Jack Monday      10
     Tuesday     40
     Wednesday   90
Jill Monday      40
     Wednesday  150

The dataframe resulting from the first sum is indexed by 'name' and by 'day'. You can see it by printing

df.groupby(['name', 'day']).sum().index 

When computing the cumulative sum, you want to do so by 'name', corresponding to the first index (level 0).

Finally, use reset_index to have the names repeated.

df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   40
2  Jack  Wednesday   90
3  Jill     Monday   40
4  Jill  Wednesday  150
user2314737
  • 24,359
  • 17
  • 91
  • 104
CT Zhu
  • 49,083
  • 16
  • 110
  • 125
  • 3
    Thanks for the answer. I did have some queries though: 1. Can you please explain what does 'level = [0]' mean? 2. Also, as you can see, you had row numbers in your data frame before and these row numbers go away once you do the cumulative sum. Is there a way to have them back? – user3694373 Oct 16 '14 at 21:31
  • 5
    1), The index number has to go, as the cumsums are from multiple rows, like the 2nd number, 40, is 10+20+10, which index value should it get? 1, 2 or 3? So, let's keep using `name` and `day` as `multiIndex`, which make better sense (`reset_index()` to get `int` index, if desired). 2), the `level=[0]` means `groupby` is to operate by the 1st level of `MultiIndex`, namely column `name`. – CT Zhu Oct 17 '14 at 00:23
  • Thanks CT. I understood that later and tried reset_index() to solve my issue. Thanks for the detailed explanation! – user3694373 Oct 17 '14 at 01:59
  • 6
    There's a subtle bug: the first `groupby()` defaults to sorting the keys, so if you add a Jack-Thursday row at the bottom of the input dataset you'll get unexpected results. And since `groupby()` can work with level names I find `df.groupby(['name', 'day'], sort=False).sum().groupby(by='name').cumsum().reset_index()` less cryptic. – Nickolay May 27 '18 at 22:57
  • How do you rename the column? – Jonathan Lam Jul 24 '20 at 04:05
  • Does ordering of columns in `groupby` matter? – scarface Dec 07 '21 at 22:00
  • Hi, could you take a look at this question https://stackoverflow.com/questions/70954791/identifying-statistical-outliers-with-pandas-groupby-and-reduce-rows-into-diffe – Aaditya Ura Feb 02 '22 at 11:35
69

Modification to @Dmitry's answer. This is simpler and works in pandas 0.19.0:

print(df) 

 name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

df['no_csum'] = df.groupby(['name'])['no'].cumsum()

print(df)
   name        day   no  no_csum
0  Jack     Monday   10       10
1  Jack    Tuesday   20       30
2  Jack    Tuesday   10       40
3  Jack  Wednesday   50       90
4  Jill     Monday   40       40
5  Jill  Wednesday  110      150
vjayky
  • 954
  • 9
  • 15
  • 3
    This seems to be the most simple solution if you [don't need the two-step aggregation](https://stackoverflow.com/questions/22650833/pandas-groupby-cumulative-sum#comment88124762_34130874), as requested in the question. – Nickolay May 27 '18 at 22:59
  • The only part I don't particularly like is that it converted my int dtype into a float. – Chris Farr Feb 14 '19 at 15:21
  • This should be the accepted answer for the cumsum in group part. @ChrisFarr It doesn't seem to convert to float anymore for me as of pandas 1.0.3. – Louis Yang Nov 03 '20 at 00:34
57

This works in pandas 0.16.2

In[23]: print df
        name          day   no
0      Jack       Monday    10
1      Jack      Tuesday    20
2      Jack      Tuesday    10
3      Jack    Wednesday    50
4      Jill       Monday    40
5      Jill    Wednesday   110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
        name          day   no  no_cumulative
0      Jack       Monday    10             10
1      Jack      Tuesday    20             30
2      Jack      Tuesday    10             40
3      Jack    Wednesday    50             90
4      Jill       Monday    40             40
5      Jill    Wednesday   110            150
Dmitry Andreev
  • 922
  • 7
  • 7
  • 1
    Showing how to add it back to the df is really helpful. I tried using a transform, but that didn't play nicely with cumsum(). – zerovector May 26 '16 at 10:09
  • 3
    Note that this answer (seems equivalent to the [simpler solution by @vjayky](https://stackoverflow.com/a/49578219/1026)) does not aggregate by `name` and `day` before calculating the cumulative sum by `name` (note: there are 2 rows for Jack+Tuesday in the result). This is what makes it simpler than [the answer by CT Zhu](https://stackoverflow.com/a/22651188/1026). – Nickolay May 27 '18 at 22:09
11

you should use

df['cum_no'] = df.no.cumsum()

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html

Another way of doing it

import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
           'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df

enter image description here

sushmit
  • 4,021
  • 2
  • 30
  • 35
  • 4
    This calculates a global running total, instead of a separate sum for each group separately. So Jill-Monday gets assigned a value of 130 (`90`, as the sum of all Jack's values, + `40`, the value for Jill-Monday). – Nickolay May 27 '18 at 22:46
  • @Nickolay just added another answer let me know if it works – sushmit May 30 '18 at 17:42
  • I am not sure if it calculates global running total as per my example row 3 gets a value of 4 – sushmit Nov 25 '18 at 18:45
  • Why do I use lambda x: x.cumsum() here, instead of pandas.series.cumsum()? – Jinhua Wang Oct 05 '19 at 10:33
7

Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() (see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

  • df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex
  • as_index=False means you do not need to call reset_index afterwards
Christoph
  • 71
  • 1
  • 2
  • Thanks for posting this, it helped me understand what's going on here! Note that `groupby().sum()` is not _just_ moving both columns to MultiIndex -- it also sums up the two values for Jack+Tuesday. And `as_index=False` doesn't seem to have any effect in this case, since the index was already set before the `groupby`. And since `groupby().cumsum()` nukes the name/day from the data frame's columns, you have to either add the resulting numeric column to the original data frame (like vjayky and Dmitry suggested), or move name/day to index, and reset_index afterwards. – Nickolay May 27 '18 at 22:39
2

data.csv:

name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110

Code:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)

Output:

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   30
2  Jack  Wednesday   50
3  Jill     Monday   40
4  Jill  Wednesday  110
   name        day   no  cumsum
0  Jack     Monday   10      10
1  Jack    Tuesday   30      40
2  Jack  Wednesday   50      90
3  Jill     Monday   40      40
4  Jill  Wednesday  110     150
Aaj Kaal
  • 1,080
  • 1
  • 7
  • 8