Collator - Locale specific texts comparison and sorting
Comparison between texts is one of the common operation that we will do when dealing with texts, especially when sorting a list of texts. java.text.String provides API such as compareTo() to compare two strings, character by character in lexicographical order. Java primitive character supports Unicode character. Comparing 2 character in Java basically is the subtraction between two code points. Read Character Encoding Terminologies to know what is code point. Therefore, it is not locale-specific. Use java.text.Collator in order to compare characters in locale-specific.
// Comparison based on code point. 122 - 246 = -124 System.out.println("z".compareTo("ö")); //Swedish System.out.println(Collator.getInstance(new Locale("sv","SE")).compare("z", "ö")); //Germen System.out.println(Collator.getInstance(new Locale("de","DE")).compare("z", "ö"));
Comparison | Result | Explanation |
---|---|---|
Non locale-specific | -124 | z less than ö |
Swedish locale | -1 | z less than ö |
German locale | 1 | z greater than ö |
Collator strength setting
There are 4 level of strength we could set upon Java Collator and it will behave differently. Refer to the execution result of the following code, it will be more easily to understand the different between each level.
public static void main(String[] args) { compareWithStrength(Collator.PRIMARY); compareWithStrength(Collator.SECONDARY); compareWithStrength(Collator.TERTIARY); // default strength setting compareWithStrength(Collator.IDENTICAL); } static void compareWithStrength(int strength) { Collator collator = Collator.getInstance(); collator.setStrength(strength); System.out.println(collator.compare("e", "f") == 0 ? "Equal" : "Different"); System.out.println(collator.compare("e", "é") == 0 ? "Equal" : "Different"); System.out.println(collator.compare("e", "E") == 0 ? "Equal" : "Different"); System.out.println(collator.compare("u0002", "u0003") == 0 ? "Equal" : "Different"); System.out.println(collator.compare("e", "e") == 0 ? "Equal" : "Different"); System.out.println("-----"); }
The result of PRIMARY strength. Only the characters from the different bases are considered different.
e | f | Different |
e | é | Equal |
e | E | Equal |
U+0002 | U+0003 | Equal |
e | e | Equal |
e | f | Different |
e | é | Different |
e | E | Equal |
U+0002 | U+0003 | Equal |
e | e | Equal |
e | f | Different |
e | é | Different |
e | E | Different |
U+0002 | U+0003 | Equal |
e | e | Equal |
e | f | Different |
e | é | Different |
e | E | Different |
U+0002 | U+0003 | Different |
e | e | Equal |
Collator decomposition setting
We could also determine the way for Collator to handle Characters Normalization by setting the decomposition flag. Again, the execution result of the program below would help us to understand how it works.
public static void main(String[] args) { Collator collator = Collator.getInstance(); // default decomposition setting collator.setDecomposition(Collator.NO_DECOMPOSITION); collator.setStrength(Collator.IDENTICAL); compareWithDecomposition(collator); collator = Collator.getInstance(); collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION); compareWithDecomposition(collator); collator = Collator.getInstance(); collator.setDecomposition(Collator.FULL_DECOMPOSITION); compareWithDecomposition(collator); } static void compareWithDecomposition (Collator collator) { // U+00E9 compare U+0065 U+0301 System.out.println(collator.compare("é", "é") == 0 ? "Equal" : "Different"); // U+FF81 compare U+30C1 System.out.println(collator.compare("チ", "チ") == 0 ? "Equal" : "Different"); System.out.println("-----"); }
The result of NO_DECOMPOSITION flag. This is the default decomposition flag for Collator. Collator with this flag will not apply canonical or compatibility decomposition to any characters during comparison.
é (U+00E9) | é (U+0065 U+0301) | Different | No canonical decomposition applied, but depend on the strength set to the `Collator`, only `IDENTICAL` strength will show they are different because without normalization, both characters are indeed different. |
チ (U+FF81) | チ (U+30C1) | Different | No compatibility decomposition applied, therefore they are 2 different characters. |
é (U+00E9) | é (U+0065 U+0301) | Equal | Canonical decomposition applied, both characters get normalized and become equal. |
チ (U+FF81) | チ (U+30C1) | Different | No compatibility decomposition applied, therefore they are 2 different characters. |
é (U+00E9) | é (U+0065 U+0301) | Equal | Canonical decomposition applied, both character get normalized and become equal. |
チ (U+FF81) | チ (U+30C1) | Equal | Comparability decomposition applied, both character get normalized and become equal. |
RuleBasedCollator - Customize and reset collation rule
RuleBasedCollator allows us to create a customized table-based Collator. It is a concrete class that implements Collator. It could accept rule in text format and process it accordingly. You can find the usage information at RuleBasedCollator API doc. In this section, I will not go into every detail of RuleBasedCollator because they are somehow straight forward, except the Reset rule element. It will be good to give more example to demonstrate how this rule element function.
Reset rule element appear in "&" syntax in the rule and its purpose is to apply modification on top of the original rule. Imagine we only have 5 alphabets and below is the rule as well as the relationship between these characters. In other words, this rule is the sorting order for these characters.
Original rule - < a < b < c < d < e
Be aware that, the whole or some initial substring of the text argument right after the Reset element must already be present in the original rule sequence.
< a < b < c < d < e & e < f - OK because 'e' present in the original rule.
< a < b < c < d < e & ei < f - OK because 'e' of the initial 'ei' present in the original rule.
< a < b < c < d < e & f < e - java.text.ParseException thrown because 'f' never present in the original rule.
Valid modification could appear in different forms as below:
< a < b < c < d < e & c < b - Reset the part of the original rule by swapping character 'b' and 'c'.
< a < b < c < d < e & c < ch - Original rule expansion where 'ch' is injected right after 'c'.
< a < b < c < d < e & b < e - Original rule contraction where 'c' and 'd' has been excluded from the original rule. If 'c' and 'd' appear in the sorting, then it will be after 'e'. E.g. 'e' then 'c' then 'd'. Bear in mind that, the order of 'c' then 'd' is outside of the original rule, which probably fallback to code point based sorting.
You could play around with the listed rules above or new rules by calling the method below. Inspect the result and you sure will get how the Reset works.
static void compareWithRule(String rule) throws ParseException { RuleBasedCollator collator = new RuleBasedCollator(rule); String[] words = {"f", "a", "ch", "e", "d", "b", "ei", "c", "h"}; String tmp; for (int i = 0; i < words.length; i++) { for (int j = i + 1; j < words.length; j++) { if (collator.compare(words[i], words[j]) > 0) { tmp = words[i]; words[i] = words[j]; words[j] = tmp; } } } for (String s : words) { System.out.println(s); } }
CollationKey - Better performance sorting
Using Collator to sort a long list of texts is time-consuming because comparison take place for every character in the texts. In order to improve the performance of sorting process, Collator is able to generates CollationKey objects. Sorting list of 'CollationKey' objects is far faster than sorting list of texts. However, CollationKey generation requires time, therefore, use Collator for comparing texts only once as it is faster than using 'CollationKey'.
public static void main(String[] args) { List<String> texts = new ArrayList<>(); texts.add("león"); texts.add("elefante"); texts.add("soportar"); texts.add("serpiente"); texts.add("leopardo"); List<CollationKey> keys = new ArrayList<>(); Collator collator = Collator.getInstance(new Locale("es")); for (String text : texts) { keys.add(collator.getCollationKey(text)); // generate CollationKey } sortTexts(keys); for (CollationKey key : keys) { System.out.println(key.getSourceString()); // get back the source string } } static void sortTexts(List<CollationKey> keys) { CollationKey tmp; int size = keys.size(); for (int i = 0; i < size; i++) { for (int j = i + 1; j < size; j++) { if (keys.get(i).compareTo(keys.get(j)) > 0) { tmp = keys.get(i); keys.set(i, keys.get(j)); keys.set(j, tmp); } } } }
Result:
elefante
león
leopardo
serpiente
soportar
elefante
león
leopardo
serpiente
soportar
Bear in mind that, while we sort the collation keys, the result we would like to have is sorted texts. Therefore, remember to get back to source text by calling getSourceString() of every 'CollationKey' objects.
Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Unicode Regular Expression
Characters Normalization
Locale Evolution
References:
http://docs.oracle.com/javase/7/docs/api/java/text/RuleBasedCollator.html
http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html
No comments:
Post a Comment