What does .NET's String.Normalize do?

Question

The MSDN article on String.Normalize states simply:

Returns a new string whose binary representation is in a particular Unicode normalization form.

And sometimes referring to a "Unicode normalization form C."

I'm just wondering, what does that mean? How is this function useful in real life situations?

+1 nice question, curious about that myself. – Adam Houldsworth Jul 20 '10 at 08:22 — Adam Houldsworth, Jul 20 '10 at 08:22

Hans Keﬆing · Answer 1 · 2020-07-16T12:21:23.733

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.

A side-effect is that this makes it possible to easily create a "remove accents" method.

public static string RemoveAccents(string input)
{
    return new string(input
        .Normalize(System.Text.NormalizationForm.FormD)
        .ToCharArray()
        .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        .ToArray());
    // the normalization to FormD splits accented letters in letters+accents
    // the rest removes those accents (and other non-spacing characters)
    // and creates a new string from the remaining chars
}

score 54 · Accepted Answer · answered Jul 20 '10 at 08:22

It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

From Unicode Standard Annex #15:

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

Excellent answer. Provided link is great! – GeReV Jul 20 '10 at 08:54 — GeReV, Jul 20 '10 at 08:54

score 6 · Answer 3 · answered Jul 20 '10 at 08:33

In Unicode, a (composed) character can either have a unique code point, or a sequence of code points consisting of the base character and its accents.

Wikipedia lists as example Vietnamese ế (U+1EBF) and its decomposed sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent).

string.Normalize() converts between the 4 normal forms a string can be coded in Unicode.

score 5 · Answer 4 · answered Jul 20 '10 at 08:22

5

This link has a good explanation:

http://unicode.org/reports/tr15/#Norm_Forms

From what I can surmise, its so you can compare two unicode strings for equality.

answered Jul 20 '10 at 08:22

Adam Houldsworth

61,803
9
143
182

What does .NET's String.Normalize do?

4 Answers4

Linked