What is Characters Normalization?
Although each character has a unique code point, but two characters or character sequences that rendered in same or similar appearance on the screen could be formed up by two different set of code points. From the human point of view, they look equivalent, but from the computer point of view, they are different because of different code points. Characters normalization is required in order to sync the expectation from both computer and human. There are 2 character equivalences defined by Unicode, Canonical Equivalence, and Compatibility Equivalence.Canonical Equivalence
- Fundamental equivalent
- Giving same visual appearance when render correctly
System.out.println("é".equals("é")); //false System.out.println("가".equals("가")); //false
Diacritical | é (U+00E9) | é (U+0065 U+0301) |
---|---|---|
Hangul | 가 (U+AC00) | 가 (U+1100 U+1161) |
Compatibility Equivalence
- Weak equivalent
- Could be distinguished visually
- May lose formatting information when replacing a character by a compatibility equivalent
System.out.println("チ".equals("チ")); //false System.out.println("fi".equals("fi")); //false
Ligature | fi (U+FB01) | fi (U+0066 U+0069) |
---|---|---|
Half Width Katakanal | チ (U+FF81) | チ (U+30C1) |
Normalization Forms
Unicode defined 4 normalization forms to preserve the equality when the cases mentioned above happen.Normalization Form D (NFD) - Canonical Decomposition
public static void main(String[] args) { String str = "é가"; // Code points before NFD: [e9] [ac00] printCodePoint(str); // Code points after NFD: [65] [301] [1100] [1161] printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFD)); } static final void printCodePoint(String str) { char[] array = str.toCharArray(); for (char c : array) { System.out.format("%x ", (int)c); } System.out.println(""); }
Normalization Form C (NFC) - Canonical Decomposition followed by Canonical Composition
String str = "é가e̊"; // Code points before NFC: [65 301] [1100 1161] [65 30a] printCodePoint(str); // Code points after NFC: [e9] [ac00] [65 30a] printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFC));
Note: e̊ does not have any composite form, hence it stay in decomposite.
Normalization Form KD (NFKD) - Compatibility Decomposition
String str = "é가fiチ"; // Code points before NFKD: [e9] [ac00] [fb01] [ff81] printCodePoint(str); // Code points after NFKD: [65 301] [1100 1161] [66 69] [30c1] printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFKD));
Note: The description given by Unicode might be insufficient because besides compatibility decomposition, the canonical decomposition also is applied.
Normalization Form KD (NFKC) - Compatibility Decomposition followed by Canonical Composition
String str = "é가fiチ"; // Code points before NFKC: [65 301] [1100 1161] [fb01] [ff81] printCodePoint(str); // Code points after NFKC: [e9] [ac00] [66 69] [30c1] printCodePoint(Normalizer.normalize(str, Normalizer.Form.NFKC));
Impact of Non-normalized text
When using the single character ("é"), we don't need to worry when we apply character count, truncation, deletion, insertion and etc onto the text. But, when using the combining sequence, then we need to handle it properly when the bespoke text operation take place. Below is a code sample for character count.
String text = "a\u0065\u0301iou"; // text: aéiou. length is 5. // Miss-leading output System.out.println(text.length()); //output: 6 System.out.println(text.codePointCount(0, text.length())); //output: 6 if (!Normalizer.isNormalized(text, Normalizer.Form.NFC)) { String normalized = Normalizer.normalize(text, Normalizer.Form.NFC); // Valid output System.out.println(normalized.length()); //output: 5 }
Another program about the text truncation.
public static void main(String[] args) { String text = "a\u0065\u0301iou"; //text: aéiou printSubString(text, 1); // When starting index fall on normal character. printSubString(text, 2); // When starting index fall on combining character. } static void printSubString(String text, int startingIndex) { System.out.println(text); // this could give miss-leading result System.out.println(text.substring(startingIndex)); if (!Normalizer.isNormalized(text, Normalizer.Form.NFC)) { String normalized = Normalizer.normalize(text, Normalizer.Form.NFC); System.out.println(normalized.substring(startingIndex)); } System.out.println(""); }
Output:
aéiou
éiou
éiou
aéiou
́iou
iou
aéiou
éiou
éiou
aéiou
́iou
iou
In short, if combining sequence is the requirement, then you have to take care of those text operation in order to avoid the incorrect result.
Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Unicode Regular Expression
Text Collation
References:
http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html
http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
http://www.unicode.org/reports/tr15/tr15-23.html#Specification
http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html
No comments:
Post a Comment