Remove accents from a dataframe column in R

Question

I got a data.table base. I got a term column in this data.table

class(base$term)
[1] character
length(base$term)
[1] 27486

I'm able to remove accents from a string. I'm able to remove accents from a vector of string.

iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"

But for some reason, it does not work when I apply the very same function on my term column

base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"

Does anybody know what is going on here?

Is it because `base$terme` is a factor? Try converting to `character` first or convert the levels maybe? — NJBurgo, Aug 25 '16 at 15:07
@NJBurgo According to the first output (assuming a typo), it’s of type `character`. — Konrad Rudolph, Aug 25 '16 at 15:08
Careful: I get a completely different result for your vectors: ``[1] "Mill'esime" "boulang`ere"`` The `iconv` documentation specifies that `TRANSLIT` gives different results on different systems (which is of course a bit useless). — Konrad Rudolph, Aug 25 '16 at 15:08
Try `iconv(base$terme[2],from="latin1",to="ASCII//TRANSLIT")`. If it doesn't work, please give the output of `Encoding(base$terme[2])`. — nicola, Aug 25 '16 at 15:15

Jeldrik · Answer 1 · 2019-06-23T15:16:30.833

37

It might be easier to use the stringi package. This way, you don't need to check the encoding beforehand. Furthermore stringi is consistent across operating systems and inconv is not.

library(stringi)

base <- data.table(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base[, terme := stri_trans_general(str = terme, 
                                   id = "Latin-ASCII")]

> base
           terme
1:     Millesime
2:    boulangere
3: ueaaaaceeeiii

edited Jun 23 '19 at 15:16

answered Jun 14 '19 at 09:20

Jeldrik

371
3
5

1

This is the only method which worked for me, thanks a lot! – Janus De Bondt Mar 17 '20 at 07:30
This is the only method that worked for me as well. However, if someone would like to fix the following one => `ds_itens$content – Luis Feb 24 '22 at 15:01

score 33 · Answer 2 · answered Aug 26 '16 at 07:57

33

Ok the way to solve the problem :

Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"

Thanks to @nicola

answered Aug 26 '16 at 07:57

hans glick

2,133
3
22
38

score 3 · Answer 3 · answered Feb 18 '19 at 16:18

You can apply this function

    rm_accent <- function(str,pattern="all") {
   if(!is.character(str))
    str <- as.character(str)

  pattern <- unique(pattern)

  if(any(pattern=="Ç"))
    pattern[pattern=="Ç"] <- "ç"

  symbols <- c(
    acute = "áéíóúÁÉÍÓÚýÝ",
    grave = "àèìòùÀÈÌÒÙ",
    circunflex = "âêîôûÂÊÎÔÛ",
    tilde = "ãõÃÕñÑ",
    umlaut = "äëïöüÄËÏÖÜÿ",
    cedil = "çÇ"
  )

  nudeSymbols <- c(
    acute = "aeiouAEIOUyY",
    grave = "aeiouAEIOU",
    circunflex = "aeiouAEIOU",
    tilde = "aoAOnN",
    umlaut = "aeiouAEIOUy",
    cedil = "cC"
  )

  accentTypes <- c("´","`","^","~","¨","ç")

  if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
    return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))

  for(i in which(accentTypes%in%pattern))
    str <- chartr(symbols[i],nudeSymbols[i], str) 

  return(str)
}

Unlike `iconv` this isn't vectorized and doesn't cover all possibilities. — Leonardo Siqueira, Apr 10 '19 at 19:29

score 3 · Answer 4 · edited Sep 02 '21 at 09:38

Three ways to remove accents - shown and compared to each other below.
The data to play with:

dtCases <- fread("https://raw.githubusercontent.com/ccodwg/Covid19Canada/master/retired_datasets/individual_level/cases_2021_1.csv", stringsAsFactors = F )
dim(dtCases) #  751526     16

Bench-marking:

> system.time(dtCases [, city0 := health_region])
   user  system elapsed 
  0.009   0.001   0.012 
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
   user  system elapsed 
  0.165   0.001   0.200 
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
   user  system elapsed 
  9.108   0.063   9.351 
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
   user  system elapsed 
   4.34    0.00    4.46

Result:

> dtCases[city0!=city1, city0:city3] %>% unique
                           city0                         city1                         city2                         city3
                          <char>                        <char>                        <char>                        <char>
1:                      Montréal                      Montreal                      Montreal                      Montreal
2:                    Montérégie                    Monteregie                    Monteregie                    Monteregie
3:          Chaudière-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches          Chaudiere-Appalaches
4:                    Lanaudière                    Lanaudiere                    Lanaudiere                    Lanaudiere
5:                Nord-du-Québec                Nord-du-Quebec                Nord-du-Quebec                Nord-du-Quebec
6:         Abitibi-Témiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue         Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8:                     Côte-Nord                     Cote-Nord                     Cote-Nord                     Cote-Nord

Conclusion:

The base::iconv() is the fastest and preferred method. Tested on French words. Not tested on other languages.

score 2 · Answer 5 · answered Aug 13 '21 at 22:30

Here is an version of Jeldrik's solution revised for DataFrames. Note the := operator is deprecated in base R.

library(stringi)

base <- data.frame(terme = c("Millésime", 
                             "boulangère", 
                             "üéâäàåçêëèïîì"))

base$terme = stri_trans_general(str = base$terme, id = "Latin-ASCII")

Remove accents from a dataframe column in R

5 Answers5

Linked