HauChee's Programming Notes: Encoding and Decoding

How to perform encoding and decoding in Java?

Using Charset class

We use java.nio.charset.Charset to instantiate the encoding scheme that we want to use. It provides the following methods for encoding and decoding.

Encoding

ByteBuffer encode(CharBuffer cb)
ByteBuffer encode(String str)

Decoding

CharBuffer decode(ByteBuffer bb)

Using String class

While Charset is coming from the perspective of encoding scheme, if we would like to start from the perspective of texts/characters, then java.lang.String is the class to go as it is used to handle texts/characters in most of the time. It provides the following API for encoding characters and decoding byte arrays.

Encoding

byte[] getBytes()
byte[] getBytes(Charset charset)
byte[] getBytes(String charsetName)

Decoding

String (byte[] bytes)
String (byte[] bytes, Charset charset)
String (byte[] bytes, int offset, int length)
String (byte[] bytes, int offset, int length, Charset charset)
String (byte[] bytes, int offset, int length, String charsetName)
String (byte[] bytes, String charsetName)

Using CharsetEncoder and CharsetDecoder class

Furthermore, if you want more control over encoding/decoding, then you can use java.nio.CharsetEncoder and java.nio.CharsetDecoder instance which is created by java.nio.charset.Charset. For example,

String copyRight = "\u00A9";
Charset charset = Charset.forName("US-ASCII");
byte[] bytes = copyRight.getBytes(charset);
System.out.println(new String(bytes, charset)); // print default replacement "?"

The code sample above using String to encode and decode a character. However, that character encoding scheme (ASCII) is not able to encode/decode the character, and hence it falls back to default replacement character, which is "?" mark. Using CharcetEncoder and CharsetDecoder, we could set the replacement character as we want.

String copyRight = "\u00A9";
Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
// When character is unmapped, trigger replace action
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.replaceWith("!".getBytes(charset)); // set Exclamation mark as replacement
ByteBuffer byteBuffer = encoder.encode(CharBuffer.wrap(copyRight.toCharArray()));
byte[] bytes = new byte[byteBuffer.limit()];
byteBuffer.get(bytes);
System.out.println(new String(bytes, charset)); // print new replacement "!"

What is the bytes outcome of encoded character?

Different encoding schemes give different of encoded results for a particular character/code point. Same applied to those encoding schemes that based on the same standard such as UTF-8 and UTF-16 which based on Unicode standard.

public static void main(String[] args) {
    printEncodedBytesForOmega(Charset.forName("UTF-8"));
    printEncodedBytesForOmega(Charset.forName("UTF-16BE"));
    printEncodedBytesForOmega(Charset.forName("IBM935"));
    printEncodedBytesForOmega(Charset.forName("Cp949"));
    printEncodedBytesForOmega(Charset.forName("MS932"));
}

private static void printEncodedBytesForOmega(Charset charset) {
    byte[] bytes = "Ω".getBytes(charset);
    System.out.format("%s >> ", charset.displayName());
    for (byte b : bytes) {
        System.out.format("%02x ", b);
    }
    System.out.println("");
}

The code sample above giving the result as below:

Encoding Scheme	Encoded Bytes (Hex)
UTF-8	ce a9
UTF-16BE	03 a9
x-IBM935	0e 41 78 0f
x-IBM949	a5 d8
windows-31j	83 b6

Therefore, it is important to use the same encoding scheme to decode the byte stream which was encoded by that particular encoding scheme. If not, you may get unexpected characters or even something that look like corrupted.

public static void main(String[] args) {
    String omega = "Ω";
    byte[] encodedBytes = omega.getBytes(Charset.forName("UTF-8"));
    String decodedString = new String(encodedBytes, Charset.forName("UTF-16BE"));
    System.out.format("Decoded string equals to original string: %b,%n"
            + "Decoded string: %s%n", decodedString.equals(omega), decodedString);
    // Decoded string equals to original string: false,
    // Decoded string: 캩
}

The size of the encoded bytes sequence/number of code unit is not fixed. For example, UTF-8 could take 1 to 4 code units (each code unit 1 byte) for a code point, while UTF-16 could take 1 to 2 code units(each code unit 2 bytes) for a code point.

Characters	Unicode Name	Code point	UTF-8 Code Units	UTF-16BE Code Units	UTF-32BE Code Units
a	Latin Small Letter A	U+0061	61	0061	00000061
©	Copyright sign	U+00A9	C2 A9	00A9	000000A9
₂	Subscript two	U+2082	E2 82 82	2082	00002082
𝅘𝅥𝅮	Musical Symbol Eighth Note	U+1D160	F0 9D 85 A0	D834 DD60	0001D160

Potential pitfall of fixing the bytes container size

Due to the fact above, never assume that 1 byte is equivalent to 1 code point. Fix the number for byte array size for each code point is not working as well. The program below showing this problem.

public static void main(String[] args) throws IOException {
    Charset charset = Charset.forName("UTF-8");
    String original = "a©₂\uD834\uDD60";
    byte[] encoded = original.getBytes(charset);
    InputStream stream = new ByteArrayInputStream(encoded);
    String decoded = decodeFromStream(stream, charset);

    System.out.format("Original: %s%nDecoded:%s%nOriginal==Decoded:%b%n",
            original, decoded, original.equals(decoded));
}

private static String decodeFromStream(InputStream stream, Charset encoding)
throws IOException {
    StringBuilder builder = new StringBuilder();
    byte[] buffer = new byte[1]; // Change the byte array size doesn't help
    while (true) {
        int r = stream.read(buffer);
        if (r < 0) {
            break;
        }
        String data = new String(buffer, 0, r, encoding);
        builder.append(data);
    }
    return builder.toString();
}

Solution

Instead of direct dealing with encoded byte stream, use java.io.InputStreamReader to decodes the byte stream. It is able to decode byte stream and return code point accordingly.

private static String decodeFromStream(InputStream stream, Charset encoding)
throws IOException {
    StringBuilder builder = new StringBuilder();
    InputStreamReader isr = new InputStreamReader(stream, encoding);
    while (true) {
        //What has been read is codepoint, instead of encoded code unit.
        int codepoint = isr.read(); 
        if (codepoint < 0) {
            break;
        }
        builder.append(Character.toChars(codepoint ));
    }
    return builder.toString();
}

HauChee's Programming Notes

Menu

Sunday, May 10, 2015

Encoding and Decoding