15

I am doing some web scraping of names into a dataframe

For a name such as "Tomáš Rosický, I get a result "Tomáš Rosický"

I tried

Encoding("Tomáš Rosický") #  with latin1 response

but was not sure where to go from there to get the original name with accents back. Played around with iconv without success

I would be satisfied (and might even prefer) an output of "Tomas Rosicky"

mathematical.coffee
  • 54,152
  • 10
  • 138
  • 187
pssguy
  • 3,355
  • 5
  • 35
  • 64
  • 2
    How did you read the data.frame? Usually you can supply an encoding parameter such as `fileEncoding` to `read.table`. And as @Hong Ooi answered, UTF-8 seems to be the encoding you need. – Tommy Mar 01 '12 at 06:48

4 Answers4

13

You've read in a page encoded in UTF-8. if x is your column of names, use Encoding(x) <- "UTF-8".

Hong Ooi
  • 54,701
  • 13
  • 127
  • 173
6

You should use this:

df$colname <- iconv(df$colname, from="UTF-8", to="LATIN1")
rink.attendant.6
  • 40,889
  • 58
  • 100
  • 149
Roadkill
  • 61
  • 1
  • 1
5

To do a correct read of the file use the scan function:

namb <- scan(file='g:/testcodering.txt', fileEncoding='UTF-8',
what=character(), sep='\n', allowEscapes=T)
cat(namb)

This also works:

namc <- readLines(con <- file('g:/testcodering.txt', "r",
encoding='UTF-8')); close(con)
cat(namc)

This will read the file with the correct accents

dpel
  • 1,696
  • 1
  • 19
  • 28
Mischa Vreeburg
  • 1,536
  • 1
  • 13
  • 17
3

A way to export accents correctly:

enc2utf8(as(dataframe$columnname, "character"))
iulilia
  • 31
  • 1