38

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?

I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this.

It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:

uniques <- as.data.frame(ftable(df$var1, df$var2))
zx8754
  • 46,390
  • 10
  • 104
  • 180
wahalulu
  • 1,417
  • 2
  • 17
  • 23

4 Answers4

55

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

Marek
  • 47,613
  • 13
  • 96
  • 119
22

@Marek's answer is obviously correct, but may be outdated. The current dplyrversion (0.7.4) allows for an even simpler code:

Simply use:

df %>% distinct(var1, var2)

If you want to keep all columns, add

df %>% distinct(var1, var2, .keep_all = TRUE)
tjebo
  • 17,289
  • 5
  • 42
  • 72
5

To KEEP all other variables in df use this:

unique_rows <- !duplicated(df[c("var1","var2")])

unique.df <- df[unique_rows,]

Another less recommended method is using row.names() #(see David's comment below):

unique_rows <- row.names(unique(df[c("var1","var2")]))

unique.df <- df[unique_rows,]
sbaniwal
  • 327
  • 4
  • 6
1

In addition to answers above, the data.table version:

setDT(df)

unique_dt = unique(df, by = c('var1', 'var2'))
Zaki
  • 91
  • 3