3

I am trying to collapse a series of intervals into fewer, equally meaningful intervals.

Consider for example this list of intervals

Intervals = list(
  c(23,34),
  c(45,48),
  c(31,35),
  c(7,16),
  c(5,9),
  c(56,57),
  c(55,58)
)

Because the intervals overlap, the same intervals can be described with few vectors. Plotting these intervals make obvious that a list of 4 vectors would be enough

plot(1,1,type="n",xlim=range(unlist(Intervals)),ylim=c(0.9,1.1))
segments(
    x0=sapply(Intervals,"[",1),
    x1=sapply(Intervals,"[",2),
    y0=rep(1,length(Intervals)),
    y1=rep(1,length(Intervals)),
    lwd=10
    )

enter image description here

How can I reduce my Intervals list to carry the same info than the one displayed on the plot? (performance matter)

The desired outputs for the above example is

Intervals = list(
  c(5,16)
  c(23,35),
  c(45,48),
  c(55,58)
)
zx8754
  • 46,390
  • 10
  • 104
  • 180
Remi.b
  • 16,247
  • 25
  • 74
  • 153
  • Quite a few hits when googling "R collapse overlapping intervals". Didn't they work the way you desired? Please show us your trials. It demonstrates that you've taken the time to try to help yourself, and it saves us from reiterating obvious answers, and it helps you get a more specific and relevant answer. I bet these could be updated with a nice `data.table::foverlaps` answer... Cheers. – Henrik Jan 26 '16 at 07:47
  • Not the same, but _possibly_ some relevant pieces in [this Q&A](http://stackoverflow.com/questions/27574775/is-it-possible-to-use-the-r-data-table-funcion-foverlaps-to-find-the-intersectio) – Henrik Jan 26 '16 at 08:00
  • @zx8754 how does this merit the `bioinformatics` tag? – sindri_baldur Jun 26 '19 at 10:07
  • 1
    @sindri_baldur Feel free to vote to rollback, but Remi's field is genetics, and accepted solution from Ven (also genetics) is from bioconductor IRranges package. Plus I came across to this post form other bioinformatics related post. – zx8754 Jun 26 '19 at 10:22

2 Answers2

4

What you need is the reduce function in the IRanges package.

In.df <- do.call(rbind, Intervals)
library(IRanges)

In.ir <- IRanges(In.df[, 1], In.df[,2])

out.ir <- reduce(In.ir)
out.ir
# IRanges of length 4
#     start end width
# [1]     5  16    12
# [2]    23  35    13
# [3]    45  48     4
# [4]    55  58     4
Ven Yao
  • 3,540
  • 2
  • 26
  • 41
0

One option with base R:

First I put your list in a data.frame:

ints <- as.data.frame(do.call(rbind, Intervals))
names(ints) <- c('start', 'stop')

so it looks like

  start stop
1    23   34
2    45   48
3    31   35
4     7   16
5     5    9
6    56   57
7    55   58

Now, two for loops compare with between, and expand an interval when a crossover is found:

for(x in 1:nrow(ints)){
  for(y in 1:nrow(ints)){
    if(between(ints$start[x], ints$start[y], ints$stop[y])){
      ints$start[x] <- ints$start[y]
      if(ints$stop[y] > ints$stop[x]){
        ints$stop[x] <- ints$stop[y]
      } else {
        ints$stop[y] <- ints$stop[x]
      }
    }
  }
}

which alters ints to

> ints
  start stop
1    23   35
2    45   48
3    23   35
4     5   16
5     5   16
6    55   58
7    55   58

Simplify to unique cases:

ints <- unique(ints, margin = 1)

and put them in order

ints <- ints[order(ints$start),]

which leaves you with

> ints
  start stop
4     5   16
1    23   35
2    45   48
6    55   58

If you want it back in a list like the original,

Intervals <- lapply(1:nrow(ints), function(x)c(ints[x,1], ints[x,2]))

(Note: You can certainly do this with *apply instead of for, Booleans instead of between, and the original list instead of a data.frame, but, well, this is readable. Rewrite/optimize as you like.)

alistaire
  • 40,464
  • 4
  • 71
  • 108
  • Side note: `stop` is probably a bad name for a variable, but I got too far in to change it. Sorry! – alistaire Jan 26 '16 at 04:10
  • Your comment is totally unclear. `stop` is not a variable, it is the name of a column. –  Jan 26 '16 at 04:26
  • From `?data.frame`: "A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame"." That said, `stop` is a bad name regardless. – alistaire Jan 26 '16 at 04:35
  • Try to access `stop` typing `stop` in your R environment. –  Jan 26 '16 at 06:03
  • `between` is a dplyr function btw – dash2 Nov 05 '21 at 11:04