20

I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.

I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.

Some example data:

  ID      start        end
1  A 2013-01-01 2013-01-05
2  A 2013-01-01 2013-01-05
3  A 2013-01-02 2013-01-03
4  A 2013-01-04 2013-01-06
5  A 2013-01-07 2013-01-09
6  A 2013-01-08 2013-01-11
7  A 2013-01-12 2013-01-15

Desired result:

  ID      start        end
1  A 2013-01-01 2013-01-06
2  A 2013-01-07 2013-01-11
3  A 2013-01-12 2013-01-15

What I have tried:

  require(dplyr)
  data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"), 
    start = structure(c(1356998400, 1356998400, 1357084800, 1357257600, 
    1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200, 
    1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct", 
    "POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA, 
-7L), class = "data.frame")

remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)  
}
data2 <- na.omit(data2)}

data <- remove.overlaps(data)
Uwe
  • 39,148
  • 11
  • 82
  • 123
Jonno Bourne
  • 1,779
  • 1
  • 22
  • 43

5 Answers5

21

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15
David Arenburg
  • 89,637
  • 17
  • 130
  • 188
  • Thanks for such a great answer! Question though, when I used the function on the real data set the dates ended up kept in second format I had to wrap the summarise variables in as.POSIXct() to convert them back any ideas why? – Jonno Bourne Mar 10 '15 at 15:01
  • Not sure what you mean by that... When I save the result in some variable , both `start` and `end` are of class `POSIXct`... – David Arenburg Mar 10 '15 at 15:09
  • 2
    Btw if you use multiple ID's you have to arrange by arrange(data, ID, start) as lag is not affected by grouping and so may take the dates from outside the ID group messing up the final structure. This wasn't part of the question but I found out the hard way afterwords. – Jonno Bourne May 22 '15 at 13:35
  • What does the `[-n()]` do? I was able to adapt this to my own needs (similar situation but with an allowance of 90 days between dates still counting as "overlapping") but I had to copy the `[-n()]` verbatim without really understanding what it does. – Dannid Feb 01 '19 at 23:34
  • Aha! I figured it out. (it's removing the last item in the `cumsum` to accommodate the added `0` at the beginning of the vector.) – Dannid Feb 01 '19 at 23:43
  • @DavidArenburg I think your solution makes the assumption that the data is ordered by the start date. If you change the order of the rows, the result might also changes. I'll therefore suggest adding arrange(ID,start) after the grouping is done. – peer May 02 '19 at 12:52
  • @peer, yes, this was already mentioed in the comments above and the other answers. – David Arenburg May 02 '19 at 20:16
  • Following up on @Dannid 's comment, how can one adapt this code to allow touching or very close intervals to be merged, i.e. let's say <5-day gap? – Fred-LM Aug 28 '19 at 15:10
14

@David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)

data %>%
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end))

Also, as @Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.

David Arenburg
  • 89,637
  • 17
  • 130
  • 188
zack
  • 4,835
  • 1
  • 18
  • 25
6

For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.

However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.

Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:

library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)

setDT(data)[, {
  ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
  .(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
       ID      start        end
   <fctr>     <POSc>     <POSc>
1:      A 2013-01-01 2013-01-06
2:      A 2013-01-07 2013-01-11
3:      A 2013-01-12 2013-01-15

A code variant is

setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
  , lapply(.SD, as_datetime), .SDcols = -"width"], 
  by = ID]

In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.

Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.

Uwe
  • 39,148
  • 11
  • 82
  • 123
  • Other than collapsing rows that have overlapping intervals, if I would also like to take the minimum value of another column, how can we do that? e.g. `data – HNSKD Jun 03 '20 at 07:21
  • @HNSKD, this should be posted as a separate question with its own [mcve], please. But a quick answer is: `library(data.table); setDT(data)[order(start, end), grp := cumsum(cummax(shift(as.numeric(end), fill = 0)) < as.numeric(start))][, .(start = min(start), end = max(end), value = min(value)), by = grp]` – Uwe Jun 03 '20 at 07:41
2

It looks like I'm a little late to the party, but I took @zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)

Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). @zach's answer does capture that edge case.

library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]
enmyj
  • 321
  • 3
  • 13
1

I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).

The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups(). This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.

It seems like you want to do this by ID, so I've also grouped by ID.

library(ivs)
library(dplyr)

data <- tribble(
  ~ID,       ~start,         ~end,
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-01", "2013-01-05",
  "A", "2013-01-02", "2013-01-03",
  "A", "2013-01-04", "2013-01-06",
  "A", "2013-01-07", "2013-01-09",
  "A", "2013-01-08", "2013-01-11",
  "A", "2013-01-12", "2013-01-15"
) %>%
  mutate(
    start = as.Date(start),
    end = as.Date(end)
  )

data
#> # A tibble: 7 × 3
#>   ID    start      end       
#>   <chr> <date>     <date>    
#> 1 A     2013-01-01 2013-01-05
#> 2 A     2013-01-01 2013-01-05
#> 3 A     2013-01-02 2013-01-03
#> 4 A     2013-01-04 2013-01-06
#> 5 A     2013-01-07 2013-01-09
#> 6 A     2013-01-08 2013-01-11
#> 7 A     2013-01-12 2013-01-15

# Combine `start` and `end` into a single interval vector column
data <- data %>%
  mutate(interval = iv(start, end), .keep = "unused")

# Note that this is a half-open interval!
data  
#> # A tibble: 7 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-05)
#> 2 A     [2013-01-01, 2013-01-05)
#> 3 A     [2013-01-02, 2013-01-03)
#> 4 A     [2013-01-04, 2013-01-06)
#> 5 A     [2013-01-07, 2013-01-09)
#> 6 A     [2013-01-08, 2013-01-11)
#> 7 A     [2013-01-12, 2013-01-15)

# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
  group_by(ID) %>%
  summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#>   ID                    interval
#>   <chr>               <iv<date>>
#> 1 A     [2013-01-01, 2013-01-06)
#> 2 A     [2013-01-07, 2013-01-11)
#> 3 A     [2013-01-12, 2013-01-15)

Created on 2022-04-05 by the reprex package (v2.0.1)

Davis Vaughan
  • 1,508
  • 5
  • 14