Efficient/Fastest way to compute minimal distance between groups in a data frame

Question

I have a data frame looks like:

x y group
1 2  1 
1 3  1
1 4  2
1 5  2
1 6  3
...

For each group, I would like to find the distance to its 'nearest' group. Here, nearest is defined as the group which has the shortest distance to that group; and distance is defined as the shortest distance between all members from those two groups. For example, the distances between all members within group 1 to all members within group 2 is:

(1,2) -> (1,4) = 2
(1,2) -> (1,5) = 3
(1,3) -> (1,4) = 1
(1,3) -> (1,5) = 2

1 is the shortest, therefore the distance between group 1 and 2 is 1. Same idea, the distances between all members within group 1 to all members within group is:

(1,2) -> (1,6) = 4
(1,3) -> (1,6) = 3

therefore the distance between group 1 and 3 is 3. Since 3 > 1, therefore the nearest neighbor to group 1 is group 2, and the distance is 1. I would like to apply this metric to a really large dataset and I am able to achieve this idea using nested-for loops, but apparently it is very slow. Is there any faster solution? Appreciated!

Is column `x` necessary? So far they have been a constant of `1`. If they are necessary how would the distance be calculated? — Sweepy Dodo, Mar 01 '22 at 22:59
yes x is necessary. I set it to 1 just for the sake of this example (it would be easier to compute the distance). The distance is Euclidean distance, that is, sqrt [(x1 - 21)^2 + (y1-y2)^2] — DigiPath, Mar 01 '22 at 23:12

Mikael Jagan · Answer 1 · 2022-03-02T01:09:20.857

2

Here is an approach that loops over pairs of groups but is at least vectorized within pairs:

d <- data.frame(x = 1L, y = 2:6, group = c(1L, 1L, 2L, 2L, 3L))
m <- do.call(rbind, d[c("x", "y")])
l <- lapply(split(seq_len(ncol(m)), d$group), function(j) m[, j, drop = FALSE])
rm(m); gc()

distance <- function(x, y) {
    j <- rep(seq_len(ncol(x)), each = ncol(y))
    min(sqrt(colSums((x[, j, drop = FALSE] - as.vector(y))^2)))
}

D <- outer(l, l, Vectorize(distance))
D
##   1 2 3
## 1 0 1 3
## 2 1 0 1
## 3 3 1 0

I would avoid outer, though, since it doesn't take advantage of the properties of the distance function, namely that distance(x, x) == 0 and distance(x, y) == distance(y, x) for all groups x and y. To obtain the outer result more efficiently, I would do:

D <- matrix(0, length(l), length(l))
D[lower.tri(D)] <- combn(length(l), 2L, function(i) distance(l[[i[1L]]], l[[i[2L]]]))
D <- D + t(D)
D
##      [,1] [,2] [,3]
## [1,]    0    1    3
## [2,]    1    0    1
## [3,]    3    1    0

edited Mar 02 '22 at 01:09

answered Mar 02 '22 at 00:03

Mikael Jagan

6,005
1
9
28

Hi Mikael thank you for the answer! I agree that we should take advantage of the fact of the symmetrical property of the distance matrix, however, for your latter solution seems like the computation is not vectorized, which may be very slow also. – DigiPath Mar 02 '22 at 15:31
Have you tested it? Perhaps you have misunderstood what function `Vectorize` is doing in the example with `outer`? `Vectorize` does not magically improve performance. Whether it is used in the example with `combn` or not has no relation to computation time... – Mikael Jagan Mar 02 '22 at 18:07
The whole operation takes ~45 minutes on my machine. I used a data frame with 500,000 rows (500 groups, 1000 rows each). You could reduce that time further by parallelizing the loop over groups. – Mikael Jagan Mar 02 '22 at 21:05

score 0 · Answer 2 · answered Mar 01 '22 at 23:13

0

You can compute the distance between each pair of x, y points using stats::dist(). After a little manipulation of the results using {broom} and {dplyr}, you can find the minimum distance within each pair of groups.

library(dplyr)
library(broom)

df <- data.frame(
  x = rep(1, 5),
  y = 2:6,
  group = c(1, 1, 2, 2, 3)
)

item_groups <- df %>% 
  transmute(item = factor(row_number()), group)

dist(df[c("x", "y")]) %>% 
  broom::tidy() %>% 
  left_join(item_groups, by = c("item1" = "item")) %>% 
  left_join(item_groups, by = c("item2" = "item"), suffix = c(".1", ".2")) %>% 
  group_by(group.1, group.2) %>% 
  filter(group.1 != group.2, distance == min(distance))

#> # A tibble: 3 x 5
#> # Groups:   group.1, group.2 [3]
#>   item1 item2 distance group.1 group.2
#>   <fct> <fct>    <dbl>   <dbl>   <dbl>
#> 1 2     3            1       1       2
#> 2 2     5            3       1       3
#> 3 4     5            1       2       3

^{Created on 2022-03-01 by the reprex package (v2.0.1)}

answered Mar 01 '22 at 23:13

zephryl

3,245
5
22

Hi, thank you for the answer. I tried to replicate this method but there is an error message saying vectory memory exhausted, perhaps due to the dataset is very large (~ 570k data points) – DigiPath Mar 01 '22 at 23:29
@DigiPath Hmm -- did that happen with your original solution? There definitely seems to be a speed-memory trade-off: [`dist` "operates with multiple objects of that size in order to speed the computation up"](https://stackoverflow.com/questions/45828087/r-running-out-of-memory-during-time-series-distance-computation). I'm not very well-versed in optimizing for memory, but that answer I linked may be a good starting point. – zephryl Mar 01 '22 at 23:39
@DigiPath I wonder if you could split the difference by iterating over pairs of `group`s, using something like my code to find the minimum distance for each group pair, and saving just the min distance for each group pair. That wouldn't require R to hold the distances among all 570k^2 points in memory all at once, but still provide some efficiency boost via `dist`. – zephryl Mar 02 '22 at 00:04
@DigiPath check to see which line of code is causing the error. In the past, I have run into this issue when doing a costly join operation. If that's the case, I was able to squeek by using `data.table` instead of `dplyr` – J.Moon Mar 02 '22 at 01:51

score 0 · Answer 3 · answered Mar 01 '22 at 23:14

0

Does this help?

library(tidyverse)
data <- tribble(
      ~x, ~y, ~group,
      1,2, 1,
      1,3, 1,
      1,4, 2,
      1,5, 2,
      1,6, 3
    )
    data %>% 
      mutate(sum_of_x_y = x+y) %>% 
      group_by(group)%>% 
      summarize(min_group =  min(sum_of_x_y))

# group min_group
# <dbl> <dbl>
# 1 3           
# 2 5           
# 3 7

answered Mar 01 '22 at 23:14

berliiiin

59
5

Hi, thank you for the answer. I am wondering why do you sum x and y? – DigiPath Mar 01 '22 at 23:23

langtang · Answer 4 · 2022-03-02T03:21:45.467

Here's another way

g = length(unique(df$grp))

matrix(
  df[, `:=`(con = 1)][df,allow.cartesian=T,on="con"] %>% 
  .[,dist:=sqrt((x-i.x)^2 + (y-i.y)^2)] %>% 
  .[, min(dist), by=.(grp,i.grp)] %>% 
  .[order(grp, i.grp),V1],g,g)

Output:

     [,1] [,2] [,3]
[1,]    0    1    3
[2,]    1    0    1
[3,]    3    1    0

If you have too many points to do the full cartesian join, you can do this, where you do for each of the pairs:

df[,con:=1]

func <- function(df) {
  df[df,allow.cartesian=T,on="con"] %>% 
    .[,dist:=sqrt((x-i.x)^2 + (y-i.y)^2)] %>% 
    .[grp!=i.grp, min(dist), by=.(grp,i.grp)][1,V1]
}

grps = unique(df$grp)
vals = apply(combn(grps,2), 2, \(p) func(df[grp %in% p]))
M = matrix(0, length(grps),length(grps))
M[lower.tri(M)] <- vals
M[upper.tri(M)] <- vals

     [,1] [,2] [,3]
[1,]    0    1    3
[2,]    1    0    1
[3,]    3    1    0

Efficient/Fastest way to compute minimal distance between groups in a data frame

4 Answers4