I am working with text in Burmese and am attempting to run a topic model in R. R seems to be having trouble displaying and rendering Burmese characters. When I set the data as a data.frame, the Burmese characters are rendered correctly:
data<-read.csv("data.csv", fileEncoding ="UTF8", encoding="UTF-8", stringsAsFactors=FALSE)
filenames<-data[,2]
txts<-data[,5]
docs <-data.frame(docs= txts,row.names=filenames)
ds <- DataframeSource(docs)
cases<-Corpus(ds)
cases[[1]]
လိုက်... #[the rest is a text file with properly rendered Burmese]
However, when the text is not from a data.frame or directly from the csv file, several characters:
data[1,5]
လိုက\u103a
The rest is a paragraph of text in which some accent marks are displayed incorrectly as in this example.
I have checked the encodings using Encoding() and R confirms that in both cases I am using UTF-8.
FYI, I use a Mac running R64. I have a colleague who uses a PC and did not encounter this issue, but we could not isolate the problem.