Shortest way to remove duplicate words from string

Question

I have this string:

x <- c("A B B C")

[1] "A B B C"

I am looking for the shortest way to get this:

[1] "A B C"

I have tried this: Removing duplicate words in a string in R

paste(unique(x), collapse = ' ')

[1] "A B B C"
# does not work

Background: In a dataframe column I want to count only the unique word counts.

You need to split based on your code `paste(unique(unlist(strsplit(x, " "))), collapse = " ")# [1] "A B C"` — akrun, Jun 03 '22 at 19:17

akrun · Accepted Answer · 2022-06-03T19:32:34.487

A regex based approach could be shorter - match the non-white space (\\S+) followed by a white space character (\\s), capture it, followed by one or more occurrence of the backreference, and in the replacement, specify the backreference to return only a single copy of the match

gsub("(\\S+\\s)\\1+", "\\1", x)
[1] "A B C"

Or may need to split the string with strsplit, unlist, get the unique and then paste

paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"

score 4 · Answer 2 · answered Jun 03 '22 at 19:38

4

Another possible solution, based on stringr::str_split:

library(tidyverse)

str_split(x, " ") %>% unlist %>% unique

#> [1] "A" "B" "C"

answered Jun 03 '22 at 19:38

PaulS

10,636
1
7
20

GKi · Answer 3 · 2022-06-04T03:45:53.317

3

Just in case the duplicates are not following each other, also using gsub.

x <- c("A B B C")
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", x, perl=TRUE)
#[1] "A B C"

gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE)
#[1] "B A ABBA"

edited Jun 04 '22 at 03:45

answered Jun 03 '22 at 20:05

GKi

27,870
2
18
35

score 2 · Answer 4 · answered Jun 03 '22 at 19:41

2

You can use ,

gsub("\\b(\\w+)(?:\\W+\\1\\b)+", "\\1", x)

answered Jun 03 '22 at 19:41

Dohamed Desouky

379
1
7

Shortest way to remove duplicate words from string

4 Answers4

Linked