Sunday, May 10, 2015

Unicode Support in Java

Java Compiler supports Unicode

Java source file could be saved with Unicode character encoding. In other words, we could write non-ASCII characters in Java source file. We could use non-ASCII characters for String value, comment, variable name, method name, parameter name, class name and even package name, and still the Java source file could be compiled without error. We can do this by explicitly tell the compiler what character encoding we are using for our Java source file.

package 商业.豪志.样本;

public class 测试 {
   private int การทดสอบ = 1;
   public int テスト(int 테스트) {
       int i = Integer.parseInt("3"); // 注解: Fullwidth Forms Digit 3
       return การทดสอบ + 테스트 + i;
   }

   public static void main(String[] args) {
       测试 测试一 = new 测试();
       int kếtquả = 测试一.テスト(3);
       System.out.println("" + kếtquả); // Guess what is the result?
   }
}

However, not all Unicode characters are eligible for naming. Only those Unicode characters/code points that return true from Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() is valid for naming. Honestly, I doubt of the practicality of using Unicode characters for naming, but this is just good to know what Java could do.

char c = '❤';
System.out.format("%s isJavaIdentifierStart >> %b, isJavaIdentifierPart >> %b%n", c, Character.isJavaIdentifierStart(c), Character.isJavaIdentifierPart(c)); 

Java is built based on Unicode Specification

Java specification and implementation comply fully to Unicode Standard since Unicode 1.0. Unicode is 16 bits basis, and therefore, Java treats all characters as 16 bits Unicode characters internally. Is that means Java could only deal with Unicode-encoded characters? What happen when a Java application receives bytes sequence that are encoded by encoding schemes other than Unicode? While Java internally using Unicode-based charset, it is also able to perform encoding and decoding with non-Unicode encoding schemes listed in here. Back to our question just now, by specifying the correct charset, the bytes sequence will be decoded and transformed into 16 bits Unicode characters. The program below proves this.

char omega = 'Ω'; // Java internal character
byte[] cp949bytes = {-91, -40}; // bytes encoded by CP949
byte[] ibm935bytes = {14, 65, 120, 15}; // bytes encoded by IBM935

int cp949omega = new String(cp949bytes, Charset.forName("Cp949")).codePointAt(0);
int ibm935omega = new String(ibm935bytes, Charset.forName("IBM935")).codePointAt(0);

System.out.println("Java internal code point: " + (int) omega); // output: 937
System.out.println("Decoded CP949 code point: " + (int) cp949omega); // output: 937
System.out.println("Decoded IBM935 code point: " + (int) ibm935omega); // output: 937
System.out.println(omega == cp949omega); // output: true
System.out.println(omega == ibm935omega); // output: true
System.out.println(cp949omega == ibm935omega); // output: true

As you can see, I am using String constructor for decoding and create a new string object. Internally, it is the java.nio.charset.CharsetDecoder that done the decoding job. On the other hand, we could use java.nio.charset.CharsetEncoder to encode characters into charset that other than Unicode.

Unicode character and code point in Java

In the era of Unicode 1.0, one char in Java is equivalent to one Unicode character, however, this is not true after Unicode 2.0. Read Surrogate Character Mechanism to find out the answer. Java classes such as java.lang.Character provides many useful API for us to deal with Unicode characters. We need to familiar with those terms such as codePoint, surrogate, and CharSet in order to know why and how to use them in the correct way. Read Character Encoding Terminologies.

From Code Point to Character

Code point in Java could be represented with Unicode escape sequence or integer value in any numeric base. The program below demonstrates different ways to turn the Ω code point into a character, besides the first one which direct printing the Ω character.

System.out.println('Ω'); // print the character directly

System.out.println('\u03A9'); // using Unicode escape in char
System.out.println("\u03A9"); // using Unicode escape in String

System.out.println((char)0x03A9); // cast code point (in hexadecimal base) to char
System.out.println((char)937); // cast code point (in decimal base) to char
System.out.println((char)01651); // cast code point (in octal base) to char
System.out.println((char)0b0000001110101001); // cast code point (in binary base) to char

// pass code point (in hexadecimal base) to toChars() method 
System.out.println(Character.toChars(0x03A9)); 
// pass code point (in decimal base) to toChars() method
System.out.println(Character.toChars(937)); 
// pass code point (in octal base) to toChars() method 
System.out.println(Character.toChars(01651)); 
// pass code point (in binary base) to toChars() method
System.out.println(Character.toChars(0b0000001110101001)); 

Writing non-ASCII character directly
Java source file must be saved with the correct character encoding, else those non-ASCII characters may change to "?" character, and they will never be recovered again. Must explicitly tell the compiler what character encoding is using. Having risk that Java source file may not able to be compiled in other compilers.

Unicode escape
Start with Unicode escape "\u" follow with the code point in 4 digits of 16-bits hexadecimal format. It is safer compared to writing the non-ASCII character directly because the Java source file could stay with basic ASCII encoding and it could be compiled as usual. Moreover, Java recognizes it either in char or String and would convert it to the corresponding character accordingly.

Casting code point to char
Internally, each char is a specific integer value (code point). Hence, casting code point to char data type will return the corresponding character. However, this approach is only able to cover character from Basic Multilingual Plate. Read Surrogate Character Mechanism to know why. To be more reliable, use java.lang.Character API to get characters from code point.

From Character to Code Point

Of cause, we could also get back the code point integer value from a character. This is done by casting the character to number primitive type. However, make sure you are casting to the data type with its positive range greater than the size of char data type.

public static void main(String[] args) {
    print((int)'Ω'); // output: 937 Ω
    print((short)'Ω'); // output: 937 Ω
    print((byte)'Ω'); // output: something else
}

private static void print(int codepoint) {
     System.out.format("Code point in decimal: %d, hex:%x, character: %s%n",
             codepoint, codepoint, (char)codepoint);
}

Refer to the table below, casting character to int primitive data type is pretty safe.

TypeBitsBytesMinimum RangeMaximum Range
char16202¹⁶-1 = 65535
byte81-2⁷  = -1282⁷-1  = 127
short162-2¹⁵ = -327682¹⁵-1 = 32767
int324-2³¹ = -21474836482³¹-1 = 2147483647

Java Application supports Unicode?

Since Java compiler and it internally supported Unicode, does that mean applications written in Java will automatically support Unicode? Well, it is not unless we know how to handle the text properly in Java. It is very common that we need to check the character properties. For example, to verify if user input digit only characters for a certain field.

public static boolean isDigit(char c) {
    return c >= '0' && c <= '9';
}

The method above is actually only limit the character checking against 10 code points. It is good for certain language but not enough when come to internalization context because there are many more valid digit characters from different languages. For example, passing which is a Fullwidth Forms Digit 3 into the method above will return false even though it is a valid digit character. It is easy to overcome this issue, by leaving the hard works to java.lang.Character class. Replace the method above by calling Character.isDigit(char).

java.lang.Character is a very useful class to handle internalization characters. It provides a few pair of methods (overloaded method) to check the specific property of a character.

public static boolean isDigit(char ch);
public static boolean isDigit(int codePoint);

public static boolean isLetter(char ch);
public static boolean isLetter(int codePoint);

public static boolean isUpperCase(char ch);
public static boolean isUpperCase(int codePoint);

// and etc

It is encouraged to use API which take codePoint instead of char as the parameter because not all character could fit into 16-bits char data type.

One of the advantages of Unicode is that each character is attached with a set of properties. Therefore, characters could be categorized easily. In Java, Character.getType(char/codePoint) is able to return the category of the character according to Unicode Specification. This is a useful method as once we know the character category, we could deal with it accordingly.

// Yen character is a currency symbol
System.out.println(Character.CURRENCY_SYMBOL == Character.getType(0x00A5));
// New line is a control character
System.out.println(Character.CONTROL == Character.getType(0x000A)); 
// Fullwidth Forms Digit 3 is a digit character
System.out.println(Character.DECIMAL_DIGIT_NUMBER == Character.getType(0xff13)); 

Characters to Unicode escape sequences

Java provides a tool called native2ascii for converting non-ASCII characters in a file to Unicode escape sequences so that the particular file could be saved with basic ASCII format. The content of this new file could be read correctly in Java. Below is the command to convert file content that saved in GBK encoding to corresponding Unicode escape sequences.


Make sure the file encoding is correct, else the new content will be turned to something that unexpected. Check the supported encoding from Supported Encodings.

No comments: