Easy way to remove accents from a Unicode string?

Question

I want to change this sentence :

Et ça sera sa moitié.

To :

Et ca sera sa moitie.

Is there an easy way to do this in Java, like I would do in Objective-C ?

NSString *str = @"Et ça sera sa moitié.";
NSData *data = [str dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *newStr = [[NSString alloc] initWithData:data encoding:NSASCIIStringEncoding];

This has nothing to do with UTF-8 or any other character encoding. — Jesper, Apr 27 '15 at 08:30

score 171 · Accepted Answer · edited Mar 13 '19 at 17:48

171

Finally, I've solved it by using the Normalizer class.

import java.text.Normalizer;

public static String stripAccents(String s) 
{
    s = Normalizer.normalize(s, Normalizer.Form.NFD);
    s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
    return s;
}

edited Mar 13 '19 at 17:48

Abdull

24,646
24
120
168

answered Mar 03 '13 at 20:58

Rob

15,407
20
67
106

Is this removing characters, or replaces characters with accents to equivalents without accents? Im asking, because this: `replaceAll("[^\\p{ASCII}]", "")` looks like replacement with nothing (removing). – Kamil Mar 03 '13 at 21:26
You're right, I just edited my answer (the aim is of course to replace and not remove the characters). – Rob Mar 03 '13 at 21:28
2

In order to correctly transform some strings, I used **`Form.NFKD`** ("Compatibility decomposition.") – Anthony O. Jul 24 '13 at 12:31
I used this to normalize the filename. – Diego Macario Apr 08 '16 at 12:56
It seems that Normalizer is deprecated and Normalizer2 should be use instead http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer2.html – ykatchou Feb 12 '19 at 12:14
@ykatchou, what makes you believe `java.text.Normalizer` to be deprecated? – Abdull Mar 13 '19 at 17:49
See here : http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html ("This API has been replaced by the Normalizer2 class and is only available for backward compatibility") – ykatchou Sep 03 '19 at 13:33
1

@ykatchou you refer to "com.ibm.icu.text.Normalizer", but answer is about "java.text.Normalizer" – David S. Sep 30 '19 at 12:20
But it changes the sense of text on different languages: stripAccents("йод,ëлка,wäre") //иод,елка,ware. How to remove only acute accents? Or any selected set of diacritics? – KursikS Oct 08 '20 at 11:15

Ondrej Bozek · Answer 2 · 2018-06-08T08:16:35.597

112

Maybe the easiest and safest way is using StringUtils from Apache Commons Lang

StringUtils.stripAccents(String input)

Removes diacritics (~= accents) from a string. The case will not be altered. For instance, 'à' will be replaced by 'a'. Note that ligatures will be left as is.

StringUtils.stripAccents()

edited Jun 08 '18 at 08:16

answered Mar 03 '13 at 21:23

Ondrej Bozek

10,336
6
52
68

4

Note that it's Apache Commons Lang3, not Commons Lang – Alexey Grigorev Oct 05 '15 at 11:30
1

It's good, but will not work for 'Ø'. – OlgaMaciaszek Nov 13 '17 at 15:12
4

Selected answer doesn't eliminate polish `ł` and `Ł` from string, this one does. – hopsey Apr 18 '19 at 21:15

score 11 · Answer 3 · answered Apr 08 '16 at 13:09

I guess the only difference is that I use a + and not a [] compared to the solution. I think both works, but it's better to have it here as well.

String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

score 7 · Answer 4 · answered Mar 20 '18 at 13:03

7

For kotlin

fun stripAccents(s: String): String 
{
    var string = Normalizer.normalize(s, Normalizer.Form.NFD)
    string = Regex("\\p{InCombiningDiacriticalMarks}+").replace(string, "")
    return  string
}

answered Mar 20 '18 at 13:03

Tristan Richard

2,607
1
14
17

2

great but it would be better with an extension function of String – Jéwôm' Jun 23 '21 at 10:19

hertzsprung · Answer 5 · 2013-03-03T21:06:17.450

5

Assuming you are using Java 6 or newer, you might want to take a look at Normalizer, which can decompose accents, then use a regex to strip the combining accents.

Otherwise, you should be able to achieve the same result using ICU4J.

edited Mar 03 '13 at 21:06

answered Mar 03 '13 at 20:59

hertzsprung

8,558
4
36
72

Easy way to remove accents from a Unicode string?

5 Answers5

Linked

Related