Sunday, May 10, 2015

Surrogate Characters Mechanism

What is Surrogate Characters Mechanism?

When Unicode was first introduced, it has 16-bits code space with 65335 code points in total. This is good enough to meet its intention which is to cover all characters from modern languages of all around the world. When the second version of Unicode was published, its code space was expanded to approximately 21-bits with 1,114,112 code points to cover those characters that never intended for Unicode before. This evolvement had splitted Unicode characters into 2 major ranges, Basic (U+0000 - U+FFFF), and Supplementary (U+10000 - U+10FFFF) range, where supplementary is then further divided into different planes. The only plane in the Basic range called Basic Multilingual Plane (BMP) basically maintain the code points in the original code space. On the other hand, the Supplementary range consists of Supplementary Multilingual Plane (SMP), Supplementary Ideographic Plane (SIP), unassigned, Supplementary Special-purpose Plane (SSP) and Supplement­ary Private Use Area (SPUA).

As Unicode encoding is still maintained in 16-bits basis, it introduced surrogate character mechanism, which allow Unicode based encoding scheme to encode/decode Supplementary code points. Unicode allocated 2 code point areas in BMP as below:

Code point areaName
U+D800 – U+DBFFHigh Surrogates
U+DC00 – U+DFFFLow Surrogates
A high surrogate code point followed by low surrogate code point is mapped to a supplementary code point. For example,

U+D800U+DC00 = U+10000
U+D800U+DFFF = U+103FF
U+DBFFU+DC00 = U+10FC00
U+DBFFU+DFFF = U+10FFFF

Isolated high surrogate or low surrogate code point have no meaning and it does not map to any character.

Impact to Java primitive char type

A char primitive data type in Java is 16-bits, which is not able to fit supplementary code point. Hence, Java also adopted the surrogate character mechanism. Which means that Java uses 2 char for a surrogate pair, 1 for the high surrogate character, the other 1 for the low surrogate character, to make up the character from Supplementary range.

/** Valid. But Java source file must saved 
 * in Unicode encoding format.**/
System.out.println("텠");

/** Valid. High surrogate character is followed
 * by low surrogate character in char array. **/
System.out.println(new char[]{'\uD834', '\uDD60'}); 

/** Valid. High surrogate character is followed
 * by low surrogate character in String. **/
System.out.println("\uD834\uDD60"); 

/** Valid. Supplementary codepoint is converted into
 * surrogate pair in char array. **/
System.out.println(Character.toChars(0x1D160)); 

/** Invalid. Cause compilation error. **/
//System.out.println('𝅘𝅥𝅮'); 

/** Invalid. Output: ᴖ0. Unicode escape sequence only
 * take 4 hex digits. **/
System.out.println("\u1D160"); 

/** Invalid. Output: ??. The high and low surrogate 
 * characters are overturned in this surrogate pair. **/
System.out.println("\uDD60\uD834"); 

/** invalid. Output: ? ?. Surrogate pair is intercepted
 * by a space character in between. **/
System.out.println("\uD834 \uDD60"); 

java.lang.Character provides API to deal with surrogate characters.

isHighSurrogate(char ch)
isLowSurrogate(char ch)
isSurrogate(char ch)
isSurrogatePair(char high, char low)
HighSurrogate(int codePoint)
lowSurrogate(int codePoint)
toCodePoint(char high, char low)
and others..

Before Unicode 2.0, we were safe to say that 1 char in Java is 1 Unicode code point. After Unicode 2.0 was published, while that statement is still valid for code points in Basic range, it is not valid for code points in Supplementary range. Because of this fact, always use methods that taking code point as parameters instead of character primitive type. Below are methods which only able to check those characters from BMP, because they only take 1 char.

Character.isLetter(char c) 
Character.isDigit(char ch)
Character.getNumericValue(char ch)
Character.getType(char ch)
and others..

Below are methods that able to cover all Unicode character.

Character.isLetter(int codepoint) 
Character.isDigit(int codepoint)
Character.getNumericValue(int codepoint)
Character.getType(int codepoint)
and others..

Impact to getting length of a String

Same thing when we would like to know the size of the char array. Simply getting the array.length could be miss-leading. Instead, use code point driven method to get the accurate array size.

String text = "a©₂\uD834\uDD60"; // text with 4 character length
System.out.println(text.length()); //output: 5
System.out.println(text.codePointCount(0, text.length())); //output: 4

Impact to String operations

Surrogate pairs also giving impact in inserting, truncating or deleting characters from String. The program below showing the problem in extracting a substring from a text string which contains surrogate characters.

String text = "a\uD834\uDD60s\uD834\uDD60\uD834\uDD60©₂"; // text: a텠s텠텠©₂
int startIndex = 2;
int endIndex = 5;
System.out.println(text.substring(startIndex, endIndex)); //output: ?s?
System.out.println(new StringBuilder(text).substring(startIndex, endIndex)); //output: ?s?

Neither String or StringBuilder working properly. To avoid the issue above, use java.text.BreakIterator to determine the correct position.

public static void main(String[] args) {
    String text = "a\uD834\uDD60s\uD834\uDD60\uD834\uDD60©₂"; // text: a텠s텠텠©₂
    int startIndex = 2;
    int endIndex = 5;

    BreakIterator charIterator = BreakIterator.getCharacterInstance();
    System.out.println(
            subString(charIterator, text, startIndex, endIndex)); // output: s텠텠
}

private static String subString(BreakIterator charIterator,
        String target, int start, int end) {
    int realStart = 0;
    int realEnd = 0;
    charIterator.setText(target);
    int boundary = charIterator.first();
    int i = 0;
    while (boundary != BreakIterator.DONE) {
        if (i == start) {
            realStart = boundary;
        }
        if (i == end) {
            realEnd = boundary;
            break;
        }
        boundary = charIterator.next();
        i++;
    }
    return target.substring(realStart, realEnd);
}

Same implementation could be applied for deleting, inserting, and truncating string which containing surrogate characters. By the way, BreakIterator is not only able to handles internalization character, it also supports word and sentence.

Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Endianness and Byte Order Mark (BOM)
Unicode Regular Expression
Characters Normalization
Text Collation

References:
http://unicode-table.com/
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates
https://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html
http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html

No comments: