Remove all duplicates by multiple variables with dplyr

Question

I'm trying to remove all duplicate values based on multiple variable using dplyr. Here's how I do it without dplyr:

dat = data.frame(id=c(1,1,2),date=c(1,1,1))
dat = dat[!(duplicated(dat[c('id','date')]) | duplicated(dat[c('id','date')],fromLast=TRUE)),]

It should only return id number 2.

akrun · Answer 1 · 2019-08-01T14:07:23.193

3

This can be done with a group_by/filter operation in tidyverse. Grouped by the columns of interest (here used group_by_all as all the columns in the dataset are grouped. Instead can also make use of group_by_at if a selected number of columns are needed)

library(dplyr)
dat %>% 
   group_by_all() %>%
   filter(n()==1)

Or simply group_by

dat %>% 
   group_by(id, date) %>%
   filter(n() == 1)

If the OP intended to use the duplicated function

dat %>%
  filter_at(vars(id, date),
        any_vars(!(duplicated(.)|duplicated(., fromLast = TRUE))))
# id date
#1  2    1

edited Aug 01 '19 at 14:07

answered Aug 01 '19 at 14:00

akrun

789,025
32
460
575

or `dat %>% count(id, date) %>% filter(n < 2)` – Roman Aug 01 '19 at 14:03
@Jimbou. One issue with `count` would be that it summarise and remove the other columns – akrun Aug 01 '19 at 14:04
2

you are right. `add_count(id, date)` would be the better option. But my feelings lead me to `dat %>% distinct_all()` – Roman Aug 01 '19 at 14:07

Remove all duplicates by multiple variables with dplyr

1 Answers1