I've seen similar questions, but haven't found an answer for what I'm dealing with. I am a first time user, so please forgive me if there is a simple solution.
I'm using the R package "tm" and I am trying to create a term-by-document matrix out of the WebKB data found at: http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/
The data comes in several different folders, each keeping track of a topic, but I've combined the documents into one file. Of all the documents, there is only one or two documents that lie in more than one topic.
Okay, so here is what I've done:
b <- Corpus(DirSource("/Users/checkout/Downloads/webkb/z"), readerControl=list(language="eng", reader=readPlain))
b <- tm_map(b, removeNumbers)
b <- tm_map(b, removePunctuation)
b <- tm_map(b, stripWhitespace)
b <- tm_map(b, tolower,lazy=TRUE)
b <- tm_map(b, removeWords, stopwords("english"),lazy=TRUE)
So far so good, no error. But when I do this next line
termByDoc <- termDocumentMatrix(b)
I get the following error
Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error" In addition: Warning messages: 1: In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) : scheduled core 1 encountered error in user code, all values of the job will be affected 2: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code
If anyone could tell me what is going wrong, I'd appreciate it! Also, if there is a more efficient way to create this term-by-document matrix, I take suggestions! And finally, I need to take out any links in these html files, is there an R function that takes care of this? I didn't see one in the documentation, so if you also have a suggestion on how to do this, it'll be appreciated.
Thanks for your time!