Sunday, May 10, 2015

Characters Normalization

What is Characters Normalization?

Although each character has a unique code point, but two characters or character sequences that rendered in same or similar appearance on the screen could be formed up by two different set of code points. From the human point of view, they look equivalent, but from the computer point of view, they are different because of different code points. Characters normalization is required in order to sync the expectation from both computer and human. There are 2 character equivalences defined by Unicode, Canonical Equivalence, and Compatibility Equivalence.

Canonical Equivalence

  • Fundamental equivalent
  • Giving same visual appearance when render correctly
System.out.println("é".equals("é")); //false
System.out.println("가".equals("가")); //false

Diacriticalé (U+00E9)é (U+0065 U+0301)
Hangul가 (U+AC00)가 (U+1100 U+1161)

Compatibility Equivalence

  • Weak equivalent
  • Could be distinguished visually
  • May lose formatting information when replacing a character by a compatibility equivalent
System.out.println("チ".equals("チ")); //false
System.out.println("fi".equals("fi")); //false

Ligaturefi (U+FB01)fi (U+0066 U+0069)
Half Width Katakanalチ (U+FF81) (U+30C1)

Normalization Forms

Unicode defined 4 normalization forms to preserve the equality when the cases mentioned above happen.

Normalization Form D (NFD) - Canonical Decomposition


public static void main(String[] args) {
    String str = "é가";

    // Code points before NFD: [e9] [ac00]
    printCodePoint(str);

    // Code points after NFD: [65] [301] [1100] [1161]
    printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFD));
}

static final void printCodePoint(String str) {
    char[] array = str.toCharArray();
    for (char c : array) {
        System.out.format("%x ", (int)c);
    }
    System.out.println("");
}

Normalization Form C (NFC) - Canonical Decomposition followed by Canonical Composition


String str = "é가e̊";
// Code points before NFC: [65 301] [1100 1161] [65 30a]
printCodePoint(str);

// Code points after NFC: [e9] [ac00] [65 30a]
printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFC));

Note: e̊ does not have any composite form, hence it stay in decomposite.

Normalization Form KD (NFKD) - Compatibility Decomposition


String str = "é가fiチ";
// Code points before NFKD: [e9] [ac00] [fb01] [ff81]
printCodePoint(str);

// Code points after NFKD: [65 301] [1100 1161] [66 69] [30c1]
printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFKD));

Note: The description given by Unicode might be insufficient because besides compatibility decomposition, the canonical decomposition also is applied.

Normalization Form KD (NFKC) - Compatibility Decomposition followed by Canonical Composition


String str = "é가fiチ";
// Code points before NFKC: [65 301] [1100 1161] [fb01] [ff81]
printCodePoint(str);

// Code points after NFKC: [e9] [ac00] [66 69] [30c1]
printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFKC));

Impact of Non-normalized text

You may wondering that, why still allow combining sequence as we could just use a single character which could display exactly the same. For example, "". Well, the acute is an accent indicator and it does not change the meaning of character "e". Therefore, when try to search "e" from a text that contains "", someone may expect to be able to find the "e", even it is with the acute accent. If we use the single character for "é" in the text, then we are not able to find "e" from that text.

When using the single character ("é"), we don't need to worry when we apply character count, truncation, deletion, insertion and etc onto the text. But, when using the combining sequence, then we need to handle it properly when the bespoke text operation take place. Below is a code sample for character count.

String text = "a\u0065\u0301iou"; // text: aéiou. length is 5.
// Miss-leading output
System.out.println(text.length()); //output: 6
System.out.println(text.codePointCount(0, text.length())); //output: 6

if (!Normalizer.isNormalized(text, Normalizer.Form.NFC)) {
    String normalized = Normalizer.normalize(text, Normalizer.Form.NFC);
    // Valid output
    System.out.println(normalized.length()); //output: 5
}

Another program about the text truncation.

public static void main(String[] args) {
    String text = "a\u0065\u0301iou"; //text: aéiou
    printSubString(text, 1); // When starting index fall on normal character.
    printSubString(text, 2); // When starting index fall on combining character.
}

static void printSubString(String text, int startingIndex) {
    System.out.println(text);
    // this could give miss-leading result
    System.out.println(text.substring(startingIndex)); 

    if (!Normalizer.isNormalized(text, Normalizer.Form.NFC)) {
        String normalized = Normalizer.normalize(text, Normalizer.Form.NFC);
        System.out.println(normalized.substring(startingIndex));
    }
    System.out.println("");
} 

Output:
aéiou
éiou
éiou

aéiou
́iou
iou

In short, if combining sequence is the requirement, then you have to take care of those text operation in order to avoid the incorrect result.

Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Unicode Regular Expression
Text Collation

References:
http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html
http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
http://www.unicode.org/reports/tr15/tr15-23.html#Specification
http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html

No comments: