How to retain the first observation in every 30 days in the dataset for each ID?

Question

I have a dataset in which I need to keep the first incidence in the database and remove what happens in 30 days window after and redo the process again. Here is a demonstration

I want to keep all the rows with arrows and exclude the other ones.

Any help would be really appreciated.

Thanks

March 1 2011 is < 30 days after Feb 1 2011, but you have not removed March 1, so your image does not match the description. — IceCreamToucan, Feb 13 '22 at 17:22
So you want some random number of days between 27 and 30 removed after each date?? — IceCreamToucan, Feb 13 '22 at 17:35
Not random. I want any record within (let's say 25 days) from the first occurrence to be removed. Then I am doing the same thing starting from the date after the first 25 days. So my thinking is as follows - Keep a record A for ID 1 - Delete any records within 25 days since record A - Keep a record X for ID 2 - Delete any records within 25 days since record X and so on — VanillaIcecream, Feb 13 '22 at 17:40
If you use 25 days, August 4 will not be removed, as in your example. — IceCreamToucan, Feb 13 '22 at 17:42
Yes, but your arrows show that you are not doing that. August 4 is more than 25 days, but that record is not kept. — IceCreamToucan, Feb 13 '22 at 17:47
I might be mistaken, but I hope I clarified what I aim for in the comments. — VanillaIcecream, Feb 13 '22 at 17:48
This question needs a [reproducible example](https://stackoverflow.com/q/5963269/5325862), and the code you've tried—or at least the description you added in comments—would help make it clearer — camille, Feb 13 '22 at 18:05

score 0 · Accepted Answer · answered Feb 13 '22 at 18:02

# Example data
set.seed(2022)
n <- 20
df <- data.frame(
  dt = rep(as.Date('2010-06-01'), n) + cumsum(sample(1:20, n, TRUE))
)

df
#>            dt
#> 1  2010-06-05
#> 2  2010-06-24
#> 3  2010-07-08
#> 4  2010-07-19
#> 5  2010-07-23
#> 6  2010-07-29
#> 7  2010-08-12
#> 8  2010-08-21
#> 9  2010-09-04
#> 10 2010-09-11
#> 11 2010-09-29
#> 12 2010-10-15
#> 13 2010-10-20
#> 14 2010-10-21
#> 15 2010-11-09
#> 16 2010-11-10
#> 17 2010-11-12
#> 18 2010-11-19
#> 19 2010-12-01
#> 20 2010-12-16

library(dplyr, warn.conflicts = FALSE)
library(purrr)

cutoff <- 25

df %>% 
  # if date is < cutoff days of first date, maintain the same group
  # else create a new group
  group_by(g = accumulate(dt, ~ if (.y - .x < cutoff) .x else .y)) %>% 
  # for each group select the first row
  slice_head(n = 1) %>% 
  # ungroup and remove grouping variable
  ungroup() %>% 
  select(-g)
#> # A tibble: 7 × 1
#>   dt        
#>   <date>    
#> 1 2010-06-05
#> 2 2010-07-08
#> 3 2010-08-12
#> 4 2010-09-11
#> 5 2010-10-15
#> 6 2010-11-09
#> 7 2010-12-16

^{Created on 2022-02-13 by the reprex package (v2.0.1)}

Or, using data.table::rowid

library(dplyr, warn.conflicts = FALSE)
library(purrr)
library(data.table, warn.conflicts = FALSE)

cutoff <- 25

df %>% 
  filter(rowid(accumulate(dt, ~ if (.y - .x < cutoff) .x else .y)) == 1)
#>           dt
#> 1 2010-06-05
#> 2 2010-07-08
#> 3 2010-08-12
#> 4 2010-09-11
#> 5 2010-10-15
#> 6 2010-11-09
#> 7 2010-12-16

^{Created on 2022-02-13 by the reprex package (v2.0.1)}

How to retain the first observation in every 30 days in the dataset for each ID?

1 Answers1