Cumulative sum that resets when 0 is encountered

Question

I would like to do a cumulative sum on a field but reset the aggregated value whenever a 0 is encountered.

Here is an example of what I want :

data.frame(campaign = letters[1:4] , 
       date=c("jan","feb","march","april"),
       b = c(1,0,1,1) ,
       whatiwant = c(1,0,1,2)
       )

 campaign  date b whatiwant
1        a   jan 1         1
2        b   feb 0         0
3        c march 1         1
4        d april 1         2

The answers to [this question I asked a couple of weeks ago](http://stackoverflow.com/questions/32247414/create-sequential-counter-that-restarts-on-a-condition-within-panel-data-groups) should help you solve this problem. — ulfelder, Sep 10 '15 at 12:36
Related: [Create counter within consecutive runs of certain values](https://stackoverflow.com/questions/5012516/create-counter-within-consecutive-runs-of-certain-values) — Henrik, May 20 '20 at 12:19

David Arenburg · Accepted Answer · 2015-09-19T22:09:42.553

22

Another base would be just

with(df, ave(b, cumsum(b == 0), FUN = cumsum))
## [1] 1 0 1 2

This will just divide column b to groups according to 0 appearances and compute the cumulative sum of b per these groups

Another solution using the latest data.table version (v 1.9.6+)

library(data.table) ## v 1.9.6+
setDT(df)[, whatiwant := cumsum(b), by = rleid(b == 0L)]
#    campaign  date b whatiwant
# 1:        a   jan 1         1
# 2:        b   feb 0         0
# 3:        c march 1         1
# 4:        d april 1         2

Some benchmarks per comments

set.seed(123)
x <- sample(0:1e3, 1e7, replace = TRUE)
system.time(res1 <- ave(x, cumsum(x == 0), FUN = cumsum))
# user  system elapsed 
# 1.54    0.24    1.81 
system.time(res2 <- Reduce(function(x, y) if (y == 0) 0 else x+y, x, accumulate=TRUE))
# user  system elapsed 
# 33.94    0.39   34.85 
library(data.table)
system.time(res3 <- data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)])
# user  system elapsed 
# 0.20    0.00    0.21 

identical(res1, as.integer(res2))
## [1] TRUE
identical(res1, res3$whatiwant)
## [1] TRUE

edited Sep 19 '15 at 22:09

answered Sep 10 '15 at 12:39

David Arenburg

89,637
17
130
188

1

This, annoyingly, needs to calculate `cumsum` twice. :-/ – Konrad Rudolph Sep 10 '15 at 12:42
Can you try with `with(rle(df1$b!=0), sequence(lengths)*rep(values, lengths))` – akrun Sep 10 '15 at 12:57
@akrun I'm getting different results. Maybe you right and we are wrong, dunno. – David Arenburg Sep 10 '15 at 13:02
I coded it based on the assumption that the column values were 0, 1. Your example is different so it wouldn't work. – akrun Sep 10 '15 at 13:03
@akrun oh, so maybe undelete your answer and put that as an assumption. Your solution should be very efficient in that case I'd guess. – David Arenburg Sep 10 '15 at 13:04
@DavidArenburg It's okay. I think your solutions are general. – akrun Sep 10 '15 at 13:05
Thanks it is perfect, I also like tried your your one akrun and works pretty well too. – patpat Sep 10 '15 at 14:48
@patpat akruns solution works only if your column is binary, – David Arenburg Sep 10 '15 at 14:56
@akrun your `rle` solution will work if you do `rle(data ==0)` or something to that effect, making it essentially binary. (And since `rle` is my favorite Rswissarmycodeknife, I hope you'll edit and undelete your answer :-) – Carl Witthoft Sep 10 '15 at 15:51
@CarlWitthoft Thanks for the comments. I did try with some general data such as `v2 – akrun Sep 10 '15 at 16:08

score 12 · Answer 2 · answered Sep 11 '15 at 13:06

Another late idea:

ff = function(x)
{
    cs = cumsum(x)
    cs - cummax((x == 0) * cs)
}
ff(c(0, 1, 3, 0, 0, 5, 2))
#[1] 0 1 4 0 0 5 7

And to compare:

library(data.table)
ffdt = function(x) 
    data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)]$whatiwant

x = as.numeric(x) ##because 'cumsum' causes integer overflow
identical(ff(x), ffdt(x))
#[1] TRUE
microbenchmark::microbenchmark(ff(x), ffdt(x), times = 25)
#Unit: milliseconds
#    expr      min       lq   median       uq      max neval
#   ff(x) 315.8010 362.1089 372.1273 386.3892 405.5218    25
# ffdt(x) 374.6315 407.2754 417.6675 447.8305 534.8153    25

score 5 · Answer 3 · answered Sep 10 '15 at 12:37

You could use the Reduce function with a custom function that returns 0 when the new value encountered is 0 and otherwise adds the new value to the accumulated value:

Reduce(function(x, y) if (y == 0) 0 else x+y, c(1, 0, 1, 1), accumulate=TRUE)
# [1] 1 0 1 2

score 0 · Answer 4 · answered Oct 21 '20 at 05:58

hutilscpp::cumsum_reset is designed for this purpose. The first argument is a logical vector, indicating when the cumulative sum should continue. The second argument is the input to the cumulative sum itself

library(hutilscpp)
b <- c(1, 0, 1, 1)
cumsum_reset(as.logical(b), b)

On my machine, compared to the data.table function above, this use of cumsum_reset is about 3 times faster.

Cumulative sum that resets when 0 is encountered

4 Answers4

Linked

Related