1

I have a data set like this, with a variable ("item") that contains comma-separated codes:

id  item
1    102, 103,401,
2    108,102,301
3    103, 108 , 405, 505, 708

For each id, I would like to get the frequencies of each separate item, like this:

id  102  103   104   108  301 401 ...
1    1    1                    1
2    1                 1    1
3         1            1

How can I do that?

peak
  • 88,177
  • 15
  • 123
  • 150

3 Answers3

3

We can do this with mtabulate from qdapTools

library(qdapTools)
cbind(dat['id'], mtabulate(strsplit(dat$item, '\\s*,\\s*')))
#  id 102 103 108 301 401 405 505 708
#1  1   1   1   0   0   1   0   0   0
#2  2   1   0   1   1   0   0   0   0
#3  3   0   1   1   0   0   1   1   1

NOTE: Data taken from @thelatemail's post.


Or another option (if we need a sparseMatrix)

library(Matrix)
#split the 'item' column to `list`
lst <- strsplit(dat$item, '\\s*,\\s*')
#get the `unique` elements after `unlist`ing.
Un1 <- sort(unique(unlist(lst)))
#create a `sparseMatrix` by specifying the row
#column index along with dim names (if needed)
sM <-  sparseMatrix(rep(dat$id, lengths(lst)), 
            match(unlist(lst), Un1), x= 1, 
             dimnames=list(dat$id, Un1))
sM
#    3 x 8 sparse Matrix of class "dgCMatrix"
#   102 103 108 301 401 405 505 708
#1   1   1   .   .   1   .   .   .
#2   1   .   1   1   .   .   .   .
#3   .   1   1   .   .   1   1   1

It can be converted to matrix by wrapping with as.matrix

as.matrix(sM)
#   102 103 108 301 401 405 505 708
#1   1   1   0   0   1   0   0   0
#2   1   0   1   1   0   0   0   0
#3   0   1   1   0   0   1   1   1
akrun
  • 789,025
  • 32
  • 460
  • 575
  • 1
    thank you, but when I run the code q – user2609451 Feb 06 '16 at 09:10
  • 1
    @user2609451 In the dataset that I took from thelatemail's post, the 'item' column is `character`. If you have a `factor` class, convert it to `character` and it should work fine (as `strsplit` needs a `character` class column) ie. `mtabulate(strsplit(as.character(text$item),...` – akrun Feb 06 '16 at 09:13
  • Sorry, starting from my url dati – user2609451 Feb 06 '16 at 09:40
  • I've resolved the above problem, but, when run the function sM , I receive this Error in i + (!(m.i || i1)) : non-numeric argument to binary operator – user2609451 Feb 06 '16 at 10:44
  • @user2609451 As I mentioned above, I was using thelatemail's data and it worked fine. Regarding the `$` operator problem, check whether you have a `matrix` as a dataset. – akrun Feb 06 '16 at 10:52
1

Use strsplit and then take advantage of a factor to make sure all the column counts are included.

spl <- strsplit(dat$item,"\\s*,\\s*")
ulevs <- sort(unique(unlist(spl)))
tab <- t(vapply(
  spl, 
  function(x) table(factor(x,levels=ulevs)),
  FUN.VALUE=numeric(length(ulevs))
))
cbind(dat["id"],tab)

#  id 102 103 108 301 401 405 505 708
#1  1   1   1   0   0   1   0   0   0
#2  2   1   0   1   1   0   0   0   0
#3  3   0   1   1   0   0   1   1   1

Data used:

dat <- read.table(text="id;item
1;102, 103,401,
2;108,102,301
3;103, 108 , 405, 505, 708",header=TRUE,sep=";",stringsAsFactors=FALSE)
thelatemail
  • 85,757
  • 12
  • 122
  • 177
  • 1
    Or a modification of your method will be `table(stack(setNames(spl, dat$id))[2:1])` – akrun Feb 06 '16 at 09:04
  • thank you..but when I run the first line I obtain this message Error in dat$item : $ operator is invalid for atomic vectors. I've imported my data set with this command dati – user2609451 Feb 06 '16 at 09:56
0

You can do this using strsplit function. Below is my solution

library(data.table)
id <- c(1:3)
item <- c("102, 103,401",  "108,102,301", "103, 108 , 405, 505, 708")
dataT <- data.table(id, item)

reqCol <- unlist(strsplit(dataT$item, split=","))
reqCol <- gsub(" ", "", reqCol)
reqCol <-  unique(reqCol)
reqColNames <- paste0("Col_", reqCol)

for(i in 1:length(reqCol)){
    expr1 <- parse(text = paste0(reqColNames[i], ":=0"))
    expr2 <- parse(text = paste0(reqColNames[i], ":=1"))
    dataT[, eval(expr1)]
    rowIndex <- grep(reqCol[i], dataT$item)
    dataT[rowIndex, eval(expr2)] 
}

I have used data.table instead of data.frame, because data.table is very fast as compared to data.frame.

Kumar Manglam
  • 2,650
  • 1
  • 16
  • 27