0

I have a dataset in which I need to keep the first incidence in the database and remove what happens in 30 days window after and redo the process again. Here is a demonstration

enter image description here

I want to keep all the rows with arrows and exclude the other ones.

Any help would be really appreciated.

Thanks

Phil
  • 5,491
  • 3
  • 26
  • 61
  • March 1 2011 is < 30 days after Feb 1 2011, but you have not removed March 1, so your image does not match the description. – IceCreamToucan Feb 13 '22 at 17:22
  • I might use some buffer 27 to 30 days is acceptable for me. – VanillaIcecream Feb 13 '22 at 17:34
  • So you want some random number of days between 27 and 30 removed after each date?? – IceCreamToucan Feb 13 '22 at 17:35
  • Not random. I want any record within (let's say 25 days) from the first occurrence to be removed. Then I am doing the same thing starting from the date after the first 25 days. So my thinking is as follows - Keep a record A for ID 1 - Delete any records within 25 days since record A - Keep a record X for ID 2 - Delete any records within 25 days since record X and so on – VanillaIcecream Feb 13 '22 at 17:40
  • If you use 25 days, August 4 will not be removed, as in your example. – IceCreamToucan Feb 13 '22 at 17:42
  • Please see the edited comment – VanillaIcecream Feb 13 '22 at 17:45
  • Yes, but your arrows show that you are not doing that. August 4 is more than 25 days, but that record is not kept. – IceCreamToucan Feb 13 '22 at 17:47
  • I might be mistaken, but I hope I clarified what I aim for in the comments. – VanillaIcecream Feb 13 '22 at 17:48
  • This question needs a [reproducible example](https://stackoverflow.com/q/5963269/5325862), and the code you've tried—or at least the description you added in comments—would help make it clearer – camille Feb 13 '22 at 18:05

1 Answers1

0
# Example data
set.seed(2022)
n <- 20
df <- data.frame(
  dt = rep(as.Date('2010-06-01'), n) + cumsum(sample(1:20, n, TRUE))
)

df
#>            dt
#> 1  2010-06-05
#> 2  2010-06-24
#> 3  2010-07-08
#> 4  2010-07-19
#> 5  2010-07-23
#> 6  2010-07-29
#> 7  2010-08-12
#> 8  2010-08-21
#> 9  2010-09-04
#> 10 2010-09-11
#> 11 2010-09-29
#> 12 2010-10-15
#> 13 2010-10-20
#> 14 2010-10-21
#> 15 2010-11-09
#> 16 2010-11-10
#> 17 2010-11-12
#> 18 2010-11-19
#> 19 2010-12-01
#> 20 2010-12-16
library(dplyr, warn.conflicts = FALSE)
library(purrr)

cutoff <- 25

df %>% 
  # if date is < cutoff days of first date, maintain the same group
  # else create a new group
  group_by(g = accumulate(dt, ~ if (.y - .x < cutoff) .x else .y)) %>% 
  # for each group select the first row
  slice_head(n = 1) %>% 
  # ungroup and remove grouping variable
  ungroup() %>% 
  select(-g)
#> # A tibble: 7 × 1
#>   dt        
#>   <date>    
#> 1 2010-06-05
#> 2 2010-07-08
#> 3 2010-08-12
#> 4 2010-09-11
#> 5 2010-10-15
#> 6 2010-11-09
#> 7 2010-12-16

Created on 2022-02-13 by the reprex package (v2.0.1)

Or, using data.table::rowid

library(dplyr, warn.conflicts = FALSE)
library(purrr)
library(data.table, warn.conflicts = FALSE)

cutoff <- 25

df %>% 
  filter(rowid(accumulate(dt, ~ if (.y - .x < cutoff) .x else .y)) == 1)
#>           dt
#> 1 2010-06-05
#> 2 2010-07-08
#> 3 2010-08-12
#> 4 2010-09-11
#> 5 2010-10-15
#> 6 2010-11-09
#> 7 2010-12-16

Created on 2022-02-13 by the reprex package (v2.0.1)

IceCreamToucan
  • 26,789
  • 2
  • 15
  • 32