1

Suppose text data looks like this:

txt <- c("peter likes red", "mary likes green", "bob likes blue")

I want to reduce those string to words from this controlled vocabulary:

voc <- c("peter", "mary", "bob", "red", "green", "blue")

The result should be a vector:

c("peter red", "mary green", "bob blue")

One can use the tm library but that only gives me a dense document term matrix:

foo <- VCorpus(VectorSource(txt))
inspect(DocumentTermMatrix(foo, list(dictionary = voc)))
Non-/sparse entries: 6/12
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

    Terms
Docs blue bob green mary peter red
   1    0   0     0    0     1   1
   2    0   0     1    1     0   0
   3    1   1     0    0     0   0

How can I get the vector solution with one string per vector element?

The solution should be fast. I'm also a big fan of base R.

EDIT: Comparison of solutions so far

On my data, James' solution is about four times faster than Sotos'. But it runs out of memory when I make the step from length(text) 1k to 10k. Sotos' solution still runs at 10k.

Given that my data has length(txt) ~1M and length(voc) ~5k I estimate that Sotos' solution will take 18 hours to finish, given that it does not run out of memory.

Isn't there anything faster?

hyco
  • 213
  • 2
  • 11
  • Would a non regex approach suffice for your case? E.g. something like `sapply(strsplit(txt, " ", TRUE), function(x) paste(collapse = " ", x[x %in% voc]))` – alexis_laz Jan 24 '17 at 17:25
  • @alexis_laz you win! Your solution finishes in 10 minutes instead of 18 hours. Would you like to create a dedicated answer so I can mark it as the solution? – hyco Jan 24 '17 at 18:09

2 Answers2

3

A base only method is:

apply(sapply(paste0("\\b",voc,"\\b"), function(x) grepl(x,txt)), 1, function(x) paste(voc[x],collapse=" "))
[1] "peter red"  "mary green" "bob blue" 

The sapply part recreates the membership matrix you used the tm package for, while the apply iterates over its rows to pull out the relevant terms from the vocabulary to paste together.

James
  • 63,608
  • 14
  • 148
  • 190
2

You can use stringi

library(stringi)
sapply(stri_extract_all_regex(txt, paste0('\\b', voc, collapse = '|', '\\b')), paste, collapse = ' ')
#[1] "peter red"  "mary green" "bob blue" 

or full stringi

stri_paste_list(stri_extract_all_regex(txt, paste0('\\b', voc, collapse = '|', '\\b')), sep = ' ')
#[1] "peter red"  "mary green" "bob blue" 
Sotos
  • 47,396
  • 5
  • 31
  • 61