0

I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts. Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.

My R codes (extract) stand as follows:

setwd("E:/sentiment")

doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)

# replace specific characters in doc1
  doc1<-gsub("[^\x01-\x7F]", "", doc1)

library(tm)

#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))

I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):

Error in doc1$Review.Text : $ operator is invalid for atomic vectors

I had a look at the following StackOverflow questions:

remove emoticons in R using tm package

Replace specific characters within strings

I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")

How can I solve this issue?

user3115933
  • 3,963
  • 13
  • 46
  • 81

1 Answers1

0

With 

doc1<-gsub("[^\x01-\x7F]", "", doc1)

 you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:

doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)

and now clear

doc1$Species

produces the error.
Eventually you want to do:

doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)
jogo
  • 12,306
  • 11
  • 34
  • 41