Sunday, May 10, 2015

Text Collation

Collator - Locale specific texts comparison and sorting

Comparison between texts is one of the common operation that we will do when dealing with texts, especially when sorting a list of texts. java.text.String provides API such as compareTo() to compare two strings, character by character in lexicographical order. Java primitive character supports Unicode character. Comparing 2 character in Java basically is the subtraction between two code points. Read Character Encoding Terminologies to know what is code point. Therefore, it is not locale-specific. Use java.text.Collator in order to compare characters in locale-specific.

// Comparison based on code point. 122 - 246 = -124
System.out.println("z".compareTo("ö"));

//Swedish
System.out.println(Collator.getInstance(new Locale("sv","SE")).compare("z", "ö"));

//Germen
System.out.println(Collator.getInstance(new Locale("de","DE")).compare("z", "ö"));

ComparisonResultExplanation
Non locale-specific-124z less than ö
Swedish locale-1z less than ö
German locale1z greater than ö

Collator strength setting

There are 4 level of strength we could set upon Java Collator and it will behave differently. Refer to the execution result of the following code, it will be more easily to understand the different between each level.

public static void main(String[] args) {       
    compareWithStrength(Collator.PRIMARY);
    compareWithStrength(Collator.SECONDARY);
    compareWithStrength(Collator.TERTIARY); // default strength setting
    compareWithStrength(Collator.IDENTICAL);
}

static void compareWithStrength(int strength) {
    Collator collator = Collator.getInstance();
    collator.setStrength(strength);
    System.out.println(collator.compare("e", "f") == 0 ? "Equal" : "Different");
    System.out.println(collator.compare("e", "é") == 0 ? "Equal" : "Different");
    System.out.println(collator.compare("e", "E") == 0 ? "Equal" : "Different");
    System.out.println(collator.compare("u0002", "u0003") == 0 ? "Equal" : "Different");
    System.out.println(collator.compare("e", "e") == 0 ? "Equal" : "Different");
    System.out.println("-----");
}

The result of PRIMARY strength. Only the characters from the different bases are considered different.
efDifferent
eéEqual
eEEqual
U+0002U+0003Equal
eeEqual
The result of SECONDARY strength. On top of PRIMARY strength, characters from the same base but different accent are considered different.
efDifferent
eéDifferent
eEEqual
U+0002U+0003Equal
eeEqual
The result of TERTIARY strength. On top of SECONDARY strength, characters from the same base but different case are considered different. This strength level is the default strength of Collator.
efDifferent
eéDifferent
eEDifferent
U+0002U+0003Equal
eeEqual
The result of IDENTICAL strength. On top of TERTIARY strength, different control/non-display characters are considered different.
efDifferent
eéDifferent
eEDifferent
U+0002U+0003Different
eeEqual

Collator decomposition setting

We could also determine the way for Collator to handle Characters Normalization by setting the decomposition flag. Again, the execution result of the program below would help us to understand how it works.

public static void main(String[] args) {   
    Collator collator = Collator.getInstance();
    // default decomposition setting
    collator.setDecomposition(Collator.NO_DECOMPOSITION);
    collator.setStrength(Collator.IDENTICAL);
    compareWithDecomposition(collator); 

    collator = Collator.getInstance();
    collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
    compareWithDecomposition(collator);

    collator = Collator.getInstance();
    collator.setDecomposition(Collator.FULL_DECOMPOSITION);
    compareWithDecomposition(collator);
}

static void compareWithDecomposition (Collator collator) {
    // U+00E9 compare U+0065 U+0301
    System.out.println(collator.compare("é", "é") == 0 ? "Equal" : "Different"); 

    // U+FF81 compare U+30C1
    System.out.println(collator.compare("チ", "チ") == 0 ? "Equal" : "Different"); 
    System.out.println("-----");
}

The result of NO_DECOMPOSITION flag. This is the default decomposition flag for Collator. Collator with this flag will not apply canonical or compatibility decomposition to any characters during comparison.
é (U+00E9)é  (U+0065 U+0301)DifferentNo canonical decomposition applied, but depend on the strength set to the `Collator`,
only `IDENTICAL` strength will show they are different because without normalization,
both characters are indeed different. 
チ (U+FF81)チ (U+30C1)DifferentNo compatibility decomposition applied, therefore they are 2 different characters.
The result of CANONICAL_DECOMPOSITION flag. This flag will enable Collator to normalizes characters in Normalization Form D approach, which only support canonical decomposition.
é (U+00E9)é  (U+0065 U+0301)EqualCanonical decomposition applied, both characters get normalized and become equal.
チ (U+FF81)チ (U+30C1)DifferentNo compatibility decomposition applied, therefore they are 2 different characters.
The result of FULL_DECOMPOSITION flag. This flag will enable Collator to normalizes characters in Normalization Form KD approach. In other words, both compatibility and canonical decomposition will be applied onto characters whenever feasible.
é (U+00E9)é  (U+0065 U+0301)EqualCanonical decomposition applied, both character get normalized and become equal.
チ (U+FF81)チ (U+30C1)EqualComparability decomposition applied, both character get normalized and become equal.

RuleBasedCollator - Customize and reset collation rule

RuleBasedCollator allows us to create a customized table-based Collator. It is a concrete class that implements Collator. It could accept rule in text format and process it accordingly. You can find the usage information at RuleBasedCollator API doc. In this section, I will not go into every detail of RuleBasedCollator because they are somehow straight forward, except the Reset rule element. It will be good to give more example to demonstrate how this rule element function.

Reset rule element appear in "&" syntax in the rule and its purpose is to apply modification on top of the original rule. Imagine we only have 5 alphabets and below is the rule as well as the relationship between these characters. In other words, this rule is the sorting order for these characters.

Original rule - < a < b < c < d < e

Be aware that, the whole or some initial substring of the text argument right after the Reset element must already be present in the original rule sequence.

< a < b < c < d < e & e < f - OK because 'e' present in the original rule.
< a < b < c < d < e & ei < f - OK because 'e' of the initial 'ei' present in the original rule.
< a < b < c < d < e & f < e - java.text.ParseException thrown because 'f' never present in the original rule.

Valid modification could appear in different forms as below:

< a < b < c < d < e & c < b - Reset the part of the original rule by swapping character 'b' and 'c'.
< a < b < c < d < e & c < ch - Original rule expansion where 'ch' is injected right after 'c'.
< a < b < c < d < e & b < e - Original rule contraction where 'c' and 'd' has been excluded from the original rule. If 'c' and 'd' appear in the sorting, then it will be after 'e'. E.g. 'e' then 'c' then 'd'. Bear in mind that, the order of 'c' then 'd' is outside of the original rule, which probably fallback to code point based sorting.

You could play around with the listed rules above or new rules by calling the method below. Inspect the result and you sure will get how the Reset works.

static void compareWithRule(String rule) throws ParseException {
    RuleBasedCollator collator = new RuleBasedCollator(rule);
    String[] words = {"f", "a", "ch", "e", "d", "b", "ei", "c", "h"};
    String tmp;
    for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) {
            if (collator.compare(words[i], words[j]) > 0) {
                tmp = words[i];
                words[i] = words[j];
                words[j] = tmp;
            }
        }
    }
    for (String s : words) {
        System.out.println(s);
    }
}

CollationKey - Better performance sorting

Using Collator to sort a long list of texts is time-consuming because comparison take place for every character in the texts. In order to improve the performance of sorting process, Collator is able to generates CollationKey objects. Sorting list of 'CollationKey' objects is far faster than sorting list of texts. However, CollationKey generation requires time, therefore, use Collator for comparing texts only once as it is faster than using 'CollationKey'.

public static void main(String[] args) {
    List<String> texts = new ArrayList<>();
    texts.add("león");
    texts.add("elefante");
    texts.add("soportar");
    texts.add("serpiente");
    texts.add("leopardo");

    List<CollationKey> keys = new ArrayList<>();
    Collator collator = Collator.getInstance(new Locale("es"));
    for (String text : texts) {
        keys.add(collator.getCollationKey(text)); // generate CollationKey
    }

    sortTexts(keys);

    for (CollationKey key : keys) {
        System.out.println(key.getSourceString()); // get back the source string
    }
}

static void sortTexts(List<CollationKey> keys) {
    CollationKey tmp;
    int size = keys.size();
    for (int i = 0; i < size; i++) {
        for (int j = i + 1; j < size; j++) {
            if (keys.get(i).compareTo(keys.get(j)) > 0) {
                tmp = keys.get(i);
                keys.set(i, keys.get(j));
                keys.set(j, tmp);
            }
        }
    }
}

Result:
elefante
león
leopardo
serpiente
soportar

Bear in mind that, while we sort the collation keys, the result we would like to have is sorted texts. Therefore, remember to get back to source text by calling getSourceString() of every 'CollationKey' objects.

Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Unicode Regular Expression
Characters Normalization
Locale Evolution

References:
http://docs.oracle.com/javase/7/docs/api/java/text/RuleBasedCollator.html
http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html

No comments: