Using R for text Mining

Question

I've seen similar questions, but haven't found an answer for what I'm dealing with. I am a first time user, so please forgive me if there is a simple solution.

I'm using the R package "tm" and I am trying to create a term-by-document matrix out of the WebKB data found at: http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/

The data comes in several different folders, each keeping track of a topic, but I've combined the documents into one file. Of all the documents, there is only one or two documents that lie in more than one topic.

Okay, so here is what I've done:

b <- Corpus(DirSource("/Users/checkout/Downloads/webkb/z"), readerControl=list(language="eng", reader=readPlain))
b <- tm_map(b, removeNumbers)
b <- tm_map(b, removePunctuation)
b <- tm_map(b, stripWhitespace)
b <- tm_map(b, tolower,lazy=TRUE)
b <- tm_map(b, removeWords, stopwords("english"),lazy=TRUE)

So far so good, no error. But when I do this next line

termByDoc <- termDocumentMatrix(b)

I get the following error

Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error" In addition: Warning messages: 1: In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) : scheduled core 1 encountered error in user code, all values of the job will be affected 2: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code

If anyone could tell me what is going wrong, I'd appreciate it! Also, if there is a more efficient way to create this term-by-document matrix, I take suggestions! And finally, I need to take out any links in these html files, is there an R function that takes care of this? I didn't see one in the documentation, so if you also have a suggestion on how to do this, it'll be appreciated.

Thanks for your time!

What happens when you step through each line, one-by-one? That's where you should begin the debugging. — Rich Scriven, Aug 28 '14 at 16:08
Thank you both for your quick responses! I checked line by line as Richard suggested with typeof(b), b, and summary(b). When I got to the line that jazzurro points out, summary(b) produces an error. When I previously tried that line without lazy=TRUE, I got this error: "Warning message: In mclapply(content(x), FUN, ...) : scheduled core 1 encountered error in user code, all values of the job will be affected" any idea how to fix this error? — user1723196, Aug 28 '14 at 16:22
In trying to replicate with a simple VectorSource(b) example to reproduce the error, I had to change `tm_map(b, removeWords, stopwords("english"), lazy=TRUE)` to `tm_map(b, removeWords, stopwords("english"))` to get it to work. Maybe lazy=TRUE is part of the problem? — nfmcclure, Aug 28 '14 at 16:32
@jazzurro: I've looked at that question and answer, and I've tried the solution suggested there, but tolower doesn't seem to work at all for me. Thank you for the reference though. I will skip this for now. I want to know how to remove html tags, because these steps do not do this. Is there are built-in function in the "tm" package, or do I need to create my own script? — user1723196, Aug 29 '14 at 14:53
In case anyone sees this post again, I've figured out my solution: >b f T ”) >b b b b termByDoc — user1723196, Aug 29 '14 at 16:56

Using R for text Mining

0 Answers0