1

have problem to change the swedish characters ä ö å in a presentable way in R
I got my data directly from MS SQL database
here are the examples

markets <- c("Caf\xe9                          ","Restaurang kv\xe4ll              ","Barnomsorg tillagningsk\xf6k     ","Folkh\xf6gskola                  ")

then I use gusb to remove the lefthand space

market=gsub(" ", "", markets,fixed = TRUE)

I got this error:
Error in gsub(" ", "", market, fixed = TRUE) :
input string 3 is invalid UTF-8

then I use this command:
markets_new=gsub(" ", "", markets)

then have strange Chinese characters in the string, "Caf攼㸹" "Restauranglunch+kv攼㸴ll" "Barnomsorgtillagningsk昼㸶k" "Folkh昼㸶gskola"

I tried the treatment change the default setting of Rstudio by following: https://yihui.name/en/2018/11/biggest-regret-knitr/?fbclid=IwAR2E5Lp0zjS51fcdjgZ1tej0sg5EBxfG8sNitt-cUA2XEshnT3lNCHNQ3Do

it does not help, was also try to use gsub() substitute the characters but seems not working.

One more thing, if I use

write.csv(markets,'submarket product view.csv',row.names = F)

then in my csv file what I see as follows

"Caf<e9>                          "
"Restaurang kv<e4>ll              "
"Barnomsorg tillagningsk<f6>k     "
"Folkh<f6>gskola                  "
"Sm<f6>rg<e5>s/salladsrestaurang     " 

I think <e9> is e with a hat, <e4> is ä, <f6> is ö, and <e5> is å
Any treatment suggestion?

Mr Lister
  • 44,061
  • 15
  • 107
  • 146
CloverCeline
  • 461
  • 3
  • 13

2 Answers2

1

Thanks to @Wiktor Stribiżew this solution works best:

df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE) 
CloverCeline
  • 461
  • 3
  • 13
0

try this

Encoding(markets) <- "UTF-16"
markets <- trimws(markets)

#[1] "Café" "Restaurang kväll" "Barnomsorg tillagningskök" "Folkhögskola"  
Wimpel
  • 22,748
  • 1
  • 17
  • 34
  • This doesn't work for me, while `latin1` works, as I said in the comments. – nicola Mar 07 '19 at 09:13
  • @nicola weird.. works just fine for me. Any ideas why? selected locale? – Wimpel Mar 07 '19 at 09:14
  • 1
    Not really. It seems very strange that something like `UTF-16` might work here (as far as I know, `UTF-16` wants two bytes for character). – nicola Mar 07 '19 at 09:18
  • Thank you for the input. I tried both. if treat 'markets' as "character" vector. both `latin1` and `UTF-16`. But currently I have a column in tibble format called 'market' has this issue. the I tried `test % head(100) %>% mutate(mar=gsub(" ", "", `Encoding% head(100) %>% mutate(mar=gsub(" ", "", `Encoding – CloverCeline Mar 07 '19 at 12:58