How to remove accents from values in columns?

Question

How do I change the special characters to the usual alphabet letters? This is my dataframe:

In [56]: cities
Out[56]:

Table Code  Country         Year        City        Value       
240         Åland Islands   2014.0      MARIEHAMN   11437.0 1
240         Åland Islands   2010.0      MARIEHAMN   5829.5  1
240         Albania         2011.0      Durrës      113249.0
240         Albania         2011.0      TIRANA      418495.0
240         Albania         2011.0      Durrës      56511.0

I want it to look like this:

In [56]: cities
Out[56]:

Table Code  Country         Year        City        Value       
240         Aland Islands   2014.0      MARIEHAMN   11437.0 1
240         Aland Islands   2010.0      MARIEHAMN   5829.5  1
240         Albania         2011.0      Durres      113249.0
240         Albania         2011.0      TIRANA      418495.0
240         Albania         2011.0      Durres      56511.0

Possible duplicate of [What is the best way to remove accents in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) — phuclv, Apr 04 '18 at 13:50

score 77 · Answer 1 · answered Jun 20 '16 at 15:39

The pandas method is to use the vectorised str.normalize combined with str.decode and str.encode:

In [60]:
df['Country'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

Out[60]:
0    Aland Islands
1    Aland Islands
2          Albania
3          Albania
4          Albania
Name: Country, dtype: object

So to do this for all str dtypes:

In [64]:
cols = df.select_dtypes(include=[np.object]).columns
df[cols] = df[cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
df

Out[64]:
   Table Code        Country    Year       City      Value
0         240  Aland Islands  2014.0  MARIEHAMN  11437.0 1
1         240  Aland Islands  2010.0  MARIEHAMN  5829.5  1
2         240        Albania  2011.0     Durres   113249.0
3         240        Albania  2011.0     TIRANA   418495.0
4         240        Albania  2011.0     Durres    56511.0

This should be the selected answer, correct solution to the problem. — Arthur, May 17 '18 at 10:18

score 7 · Answer 2 · answered Jan 06 '18 at 14:35

7

With pandas series example

def remove_accents(a):
    return unidecode.unidecode(a.decode('utf-8'))

df['column'] = df['column'].apply(remove_accents)

in this case decode asciis

answered Jan 06 '18 at 14:35

Caio Andrian

71
1
3

score 4 · Answer 3 · answered Jun 20 '16 at 15:29

4

This is for Python 2.7. For converting to ASCII you might want to try:

import unicodedata

unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'

answered Jun 20 '16 at 15:29

advance512

1,218
8
19

score 1 · Answer 4 · answered May 17 '21 at 17:42

1

I want to remove all de accents in all the names of columns so I used

df.columns = df.columns.str.normalize('NFKD').str.encode('ascii',errors='ignore').str.decode('utf-8')

answered May 17 '21 at 17:42

Joselin Ceron

386
3
3

score -10 · Accepted Answer · edited May 23 '17 at 10:29

-10

Use this code:

df['Country'] = df['Country'].str.replace(u"Å", "A")
df['City'] = df['City'].str.replace(u"ë", "e")

See here! Of course you should do it then for every special character and every column.

edited May 23 '17 at 10:29

Community

1
1

answered Jun 20 '16 at 15:45

Blind0ne

945
11
28

How to remove accents from values in columns?

5 Answers5

Linked

Related