1

I have strings with multiple potential duplicated words:

df <- data.frame(
  words = c("if,go,if,to,go,and,if,go,don't,is,give,to,my,go",
            NA,
            "like,like,so,many,times,like,so,one,no,no,no,bathroom"))

I would like to reduce the words strings such that only the unique words values remain. I've tried this regex but the result it produces is far from perfect:

library(stringr)
str_extract_all(df$words, "(?<=\\s|^)(\\w+)(?=,|$)(?!\\1+)")
[[1]]
[1] "if"

[[2]]
[1] NA

[[3]]
[1] "like"

The result I need to get (preferably with a regex answer) is this:

[[1]]
[1] "if,go,to,and,don't,is,give,my"

[[2]]
[1] NA

[[3]]
[1] "like,so,many,times,one,no,bathroom"
Chris Ruehlemann
  • 15,379
  • 3
  • 11
  • 27

2 Answers2

3

Here is a base R solution using gsub:

df$words <- gsub("(?<![^,])(.*?),(?=.*\\1)", "", df$words, perl=TRUE)
df

                               words
1      and,if,don't,is,give,to,my,go
2                               <NA>
3 many,times,like,so,one,no,bathroom

Data:

df <- data.frame(words = c("if,go,if,to,go,and,if,go,don't,is,give,to,my,go",
                           NA,
                           "like,like,so,many,times,like,so,one,no,no,no,bathroom"))

Here is an explanation of the regex pattern:

(?<![^,])  assert that what precedes is either a comma or the start of the string
(.*?)      match AND capture a word, up until reaching
,          the nearest following comma
(?=.*\\1)  then assert that we can still find this same word later on
           in the string, indicating that what we just matched is a duplicate

Then, we replace such duplicate words with empty string, to effectively remove them from the input.

Tim Biegeleisen
  • 451,927
  • 24
  • 239
  • 318
2
lapply(strsplit(df$words, ",") , function(x) paste(unique(x), collapse = ","))

# [[1]]
# [1] "if,go,to,and,don't,is,give,my"
# 
# [[2]]
# [1] "NA"
# 
# [[3]]
# [1] "like,so,many,times,one,no,bathroom"
sindri_baldur
  • 25,109
  • 3
  • 30
  • 57