Sunday, May 10, 2015

Encoding and Decoding

How to perform encoding and decoding in Java?

Using Charset class

We use java.nio.charset.Charset to instantiate the encoding scheme that we want to use. It provides the following methods for encoding and decoding.

  • Encoding
    • ByteBuffer encode(CharBuffer cb)
    • ByteBuffer encode(String str)
  • Decoding
    • CharBuffer decode(ByteBuffer bb)

Using String class

While Charset is coming from the perspective of encoding scheme, if we would like to start from the perspective of texts/characters, then java.lang.String is the class to go as it is used to handle texts/characters in most of the time. It provides the following API for encoding characters and decoding byte arrays.
  • Encoding
    • byte[] getBytes()
    • byte[] getBytes(Charset charset)
    • byte[] getBytes(String charsetName)
  • Decoding
    • String (byte[] bytes)
    • String (byte[] bytes, Charset charset)
    • String (byte[] bytes, int offset, int length)
    • String (byte[] bytes, int offset, int length, Charset charset)
    • String (byte[] bytes, int offset, int length, String charsetName)
    • String (byte[] bytes, String charsetName)

Using CharsetEncoder and CharsetDecoder class

Furthermore, if you want more control over encoding/decoding, then you can use java.nio.CharsetEncoder and java.nio.CharsetDecoder instance which is created by java.nio.charset.Charset. For example,

String copyRight = "\u00A9";
Charset charset = Charset.forName("US-ASCII");
byte[] bytes = copyRight.getBytes(charset);
System.out.println(new String(bytes, charset)); // print default replacement "?"

The code sample above using String to encode and decode a character. However, that character encoding scheme (ASCII) is not able to encode/decode the character, and hence it falls back to default replacement character, which is "?" mark. Using CharcetEncoder and CharsetDecoder, we could set the replacement character as we want.

String copyRight = "\u00A9";
Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
// When character is unmapped, trigger replace action
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.replaceWith("!".getBytes(charset)); // set Exclamation mark as replacement
ByteBuffer byteBuffer = encoder.encode(CharBuffer.wrap(copyRight.toCharArray()));
byte[] bytes = new byte[byteBuffer.limit()];
byteBuffer.get(bytes);
System.out.println(new String(bytes, charset)); // print new replacement "!"

What is the bytes outcome of encoded character?

Different encoding schemes give different of encoded results for a particular character/code point. Same applied to those encoding schemes that based on the same standard such as UTF-8 and UTF-16 which based on Unicode standard.

public static void main(String[] args) {
    printEncodedBytesForOmega(Charset.forName("UTF-8"));
    printEncodedBytesForOmega(Charset.forName("UTF-16BE"));
    printEncodedBytesForOmega(Charset.forName("IBM935"));
    printEncodedBytesForOmega(Charset.forName("Cp949"));
    printEncodedBytesForOmega(Charset.forName("MS932"));
}

private static void printEncodedBytesForOmega(Charset charset) {
    byte[] bytes = "Ω".getBytes(charset);
    System.out.format("%s >> ", charset.displayName());
    for (byte b : bytes) {
        System.out.format("%02x ", b);
    }
    System.out.println("");
}

The code sample above giving the result as below:

Encoding SchemeEncoded Bytes (Hex)
UTF-8 ce a9 
UTF-16BE03 a9
x-IBM9350e 41 78 0f 
x-IBM949a5 d8 
windows-31j83 b6
Therefore, it is important to use the same encoding scheme to decode the byte stream which was encoded by that particular encoding scheme. If not, you may get unexpected characters or even something that look like corrupted.

public static void main(String[] args) {
    String omega = "Ω";
    byte[] encodedBytes = omega.getBytes(Charset.forName("UTF-8"));
    String decodedString = new String(encodedBytes, Charset.forName("UTF-16BE"));
    System.out.format("Decoded string equals to original string: %b,%n"
            + "Decoded string: %s%n", decodedString.equals(omega), decodedString);
    // Decoded string equals to original string: false,
    // Decoded string: 캩
}

The size of the encoded bytes sequence/number of code unit is not fixed. For example, UTF-8 could take 1 to 4 code units (each code unit 1 byte) for a code point, while UTF-16 could take 1 to 2 code units(each code unit 2 bytes) for a code point.

Characters Unicode NameCode pointUTF-8 Code UnitsUTF-16BE Code UnitsUTF-32BE Code Units
aLatin Small Letter AU+006161006100000061
©Copyright signU+00A9C2 A900A9000000A9
Subscript twoU+2082E2 82 82208200002082
𝅘𝅥𝅮Musical Symbol Eighth NoteU+1D160F0 9D 85 A0D834 DD600001D160 

Potential pitfall of fixing the bytes container size

Due to the fact above, never assume that 1 byte is equivalent to 1 code point. Fix the number for byte array size for each code point is not working as well. The program below showing this problem.

public static void main(String[] args) throws IOException {
    Charset charset = Charset.forName("UTF-8");
    String original = "a©₂\uD834\uDD60";
    byte[] encoded = original.getBytes(charset);
    InputStream stream = new ByteArrayInputStream(encoded);
    String decoded = decodeFromStream(stream, charset);

    System.out.format("Original: %s%nDecoded:%s%nOriginal==Decoded:%b%n",
            original, decoded, original.equals(decoded));
}

private static String decodeFromStream(InputStream stream, Charset encoding)
throws IOException {
    StringBuilder builder = new StringBuilder();
    byte[] buffer = new byte[1]; // Change the byte array size doesn't help
    while (true) {
        int r = stream.read(buffer);
        if (r < 0) {
            break;
        }
        String data = new String(buffer, 0, r, encoding);
        builder.append(data);
    }
    return builder.toString();
}

Result:
Original: a©₂𝅘𝅥𝅮
Decoded:a?????????
Original==Decoded:false

Solution

Instead of direct dealing with encoded byte stream, use java.io.InputStreamReader to decodes the byte stream. It is able to decode byte stream and return code point accordingly.

private static String decodeFromStream(InputStream stream, Charset encoding)
throws IOException {
    StringBuilder builder = new StringBuilder();
    InputStreamReader isr = new InputStreamReader(stream, encoding);
    while (true) {
        //What has been read is codepoint, instead of encoded code unit.
        int codepoint = isr.read(); 
        if (codepoint < 0) {
            break;
        }
        builder.append(Character.toChars(codepoint ));
    }
    return builder.toString();
}

Result:
Original: a©₂𝅘𝅥𝅮
Decoded:a©₂𝅘𝅥𝅮
Original==Decoded:true

Related topics:
Character Encoding Terminologies
Unicode Support in Java
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Unicode Regular Expression
Characters Normalization
Text Collation

References:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetDecoder.html
http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html

No comments: