0

Here is my data:

mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 5, 9, 6, 6, 4, 1, 4), .Dim = c(7L, 2L))

Some rows are duplicated, several other rows contain the same elements although they are differentially ordered. I wish to remove all rows that contain the same elements, whether these elements are in the same (duplicated rows) or different order. This will retain only the first row of c(3, 5).

I checked previous questions here and here. However, my requirement is that all such rows are removed rather than leaving one such row.

My question is also different from this one which removes all duplicated rows in that I look for rows not just duplicated, but also those that contain the same set of elements that are ordered differently. For example, rows c(6, 9) and c(9, 6) should both be removed since they both contian the same set of elements.

I look for solutions not using for loop since my real data is large and for loop may be slow.

Note: My full data has 40k rows and 2 columns.

user438383
  • 4,338
  • 6
  • 23
  • 35
Patrick
  • 727
  • 3
  • 15

5 Answers5

2

You can sort the data rowwise and use duplicated -

tmp <- t(apply(mymat, 1, sort))
tmp[!(duplicated(tmp) | duplicated(tmp, fromLast = TRUE)), , drop = FALSE]

#     [,1] [,2]
#[1,]    3    5
Ronak Shah
  • 355,584
  • 18
  • 123
  • 178
1

I added a little data to show that the matrix format remains

mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 10, 12, 13, 14, 5, 9, 6, 6, 4, 1, 4, 11, 13, 12, 15), .Dim = c(11L, 2L))

dup <- duplicated(rbind(mymat, mymat[, c(2, 1)]))
dup_fromLast <- duplicated(rbind(mymat, mymat[, c(2, 1)]), fromLast = TRUE)

mymat_duprm <- mymat[!(dup_fromLast | dup)[1:(length(dup) / 2)], ]

mymat_duprm
Harrison Jones
  • 1,555
  • 1
  • 22
  • 29
1

As a matrix:

tmp <- apply(mymat, 1, function(z) toString(sort(z)))
mymat[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE]
#      [,1] [,2]
# [1,]    3    5

The drop=FALSE is required only because (at least with this sample data) the filtering results in one row. While I doubt your real data (with 40k rows) would reduce to this, I recommend you keep it in there anyway ("just in case", and it's just defensive programming).

r2evans
  • 108,754
  • 5
  • 72
  • 122
1

Benchmarking a couple new solutions along with a few already posted:

library(Rfast)
library(microbenchmark)

mymat <- matrix(sample(100, 4000, replace = TRUE), nrow = 2000)

noDup <- function(m) {
  return(!(duplicated(m) | duplicated(m, fromLast = TRUE)))
}

combounique1 <- function(m) {
  return(m[noDup(rowSort(m)),])
}

combounique2 <- function(m) {
  msum <- rowsums(m)
  return(m[noDup(rowsums(m^2) + msum + (msum - 3)*abs(m[,1] - m[,2])),])
}

combounique3 <- function(m) {
  return(m[noDup(rowsums(m + 1/m)),])
}

combounique4 <- function(m) {
  # similar to Harrison Jones, but correct
  return(m[noDup(rbind(m, m[m[,1] != m[,2], 2:1]))[1:nrow(m)],])
}

combounique5 <- function(m) {
  # similar to Ronak Shah, but maintains ordering within rows
  tmp <- t(apply(m, 1, sort))
  return(m[noDup(tmp),])
}

r2evans <- function(m) {
  tmp <- apply(m, 1, function(z) toString(sort(z)))
  return(m[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE])
}

microbenchmark(mymat1 <- combounique1(mymat),
               mymat2 <- combounique2(mymat),
               mymat3 <- combounique3(mymat),
               mymat4 <- combounique4(mymat),
               mymat5 <- combounique5(mymat),
               mymat6 <- r2evans(mymat))

                          expr     min       lq      mean   median       uq      max neval
 mymat1 <- combounique1(mymat)  7129.9  7642.30  9236.841  8205.45  9467.70  28363.7   100
 mymat2 <- combounique2(mymat)   171.0   197.30   219.341   215.75   225.45    385.5   100
 mymat3 <- combounique3(mymat)   144.2   166.95   187.340   182.50   192.30    306.7   100
 mymat4 <- combounique4(mymat) 14263.1 15343.90 17938.061 16417.30 19043.30  34884.9   100
 mymat5 <- combounique5(mymat) 48230.9 50773.75 57662.463 55041.90 60968.35 193804.2   100
      mymat6 <- r2evans(mymat) 66180.3 70835.30 78642.552 77299.85 81992.60 161034.5   100

> all(sapply(list(mymat1, mymat2, mymat3, mymat4, mymat5, mymat6), FUN = identical, mymat1))
[1] TRUE

Note that combounique2 and combounique3 are only strictly correct for integer values. The idea is to use a symmetric pairing function to get a unique value for each pair of integers, then use duplicated on that. (see https://math.stackexchange.com/questions/3162166/what-function-symmetric-and-has-unique-solution)

jblood94
  • 4,877
  • 1
  • 7
  • 10
0

You can just use, the following line of code:

mymat <- mymat[!mymat[,1] %in% mymat[,2], , drop = FALSE]

output:

mymat
#>      [,1] [,2]
#> [1,]    3    5

Created on 2021-09-24 by the reprex package (v0.3.0)

lovalery
  • 4,254
  • 3
  • 13
  • 27