R string, UTF-8 coding swedish character treatment

Question

have problem to change the swedish characters ä ö å in a presentable way in R
I got my data directly from MS SQL database
here are the examples

markets <- c("Caf\xe9                          ","Restaurang kv\xe4ll              ","Barnomsorg tillagningsk\xf6k     ","Folkh\xf6gskola                  ")

then I use gusb to remove the lefthand space

market=gsub(" ", "", markets,fixed = TRUE)

I got this error:
Error in gsub(" ", "", market, fixed = TRUE) :
input string 3 is invalid UTF-8

then I use this command:
markets_new=gsub(" ", "", markets)

then have strange Chinese characters in the string, "Caf攼㸹" "Restauranglunch+kv攼㸴ll" "Barnomsorgtillagningsk昼㸶k" "Folkh昼㸶gskola"

I tried the treatment change the default setting of Rstudio by following: https://yihui.name/en/2018/11/biggest-regret-knitr/?fbclid=IwAR2E5Lp0zjS51fcdjgZ1tej0sg5EBxfG8sNitt-cUA2XEshnT3lNCHNQ3Do

it does not help, was also try to use gsub() substitute the characters but seems not working.

One more thing, if I use

write.csv(markets,'submarket product view.csv',row.names = F)

then in my csv file what I see as follows

"Caf<e9>                          "
"Restaurang kv<e4>ll              "
"Barnomsorg tillagningsk<f6>k     "
"Folkh<f6>gskola                  "
"Sm<f6>rg<e5>s/salladsrestaurang     "

I think <e9> is e with a hat, <e4> is ä, <f6> is ö, and <e5> is å
Any treatment suggestion?

It works fine as is in my Windows RGui 3.4.3 build. The problem is most likely with the locale. — Wiktor Stribiżew, Mar 07 '19 at 09:15
@WiktorStribiżew: it not really work, when I use that commend for one column in data frame or tibble I got the result:Cafæ”¼ã¸¹ ** , **Restaurangostorkæ˜¼ã . but it works if I only apply on it on this character vector. Any more suggestion? thank you! — CloverCeline, Mar 08 '19 at 09:05
Provide [reproducible data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Wiktor Stribiżew, Mar 08 '19 at 09:07
@WiktorStribiżew: hi Wiktor, her is the code: m % mutate(market=gsub(" ", "", `Encoding — CloverCeline, Mar 10 '19 at 13:06

score 1 · Accepted Answer · answered Mar 11 '19 at 07:55

1

Thanks to @Wiktor Stribiżew this solution works best:

df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE)

answered Mar 11 '19 at 07:55

CloverCeline

461
3
13

score 0 · Answer 2 · answered Mar 07 '19 at 09:11

0

try this

Encoding(markets) <- "UTF-16"
markets <- trimws(markets)

#[1] "Café" "Restaurang kväll" "Barnomsorg tillagningskök" "Folkhögskola"

answered Mar 07 '19 at 09:11

Wimpel

22,748
1
17
34

This doesn't work for me, while `latin1` works, as I said in the comments. – nicola Mar 07 '19 at 09:13
@nicola weird.. works just fine for me. Any ideas why? selected locale? – Wimpel Mar 07 '19 at 09:14
1

Not really. It seems very strange that something like `UTF-16` might work here (as far as I know, `UTF-16` wants two bytes for character). – nicola Mar 07 '19 at 09:18
Thank you for the input. I tried both. if treat 'markets' as "character" vector. both `latin1` and `UTF-16`. But currently I have a column in tibble format called 'market' has this issue. the I tried `test % head(100) %>% mutate(mar=gsub(" ", "", `Encoding% head(100) %>% mutate(mar=gsub(" ", "", `Encoding – CloverCeline Mar 07 '19 at 12:58

R string, UTF-8 coding swedish character treatment

2 Answers2