Sunday, May 10, 2015

Endianness and Byte Order Mark (BOM)

What is Endianness and Byte Order Mark (BOM)?

The encoding scheme such as UTF-16 and UTF-32 has to handle the endianness of their code units. This is because each single code unit of them having the size bigger than 8-bits, the size where computer organizing its memory unit. This makes the bytes order important to interpret the correct data. There are 2 type of endianness, Big-Endian, and Little-Endian. In Big-Endian order, data is read/wrote from the most significant byte first and subsequently until the least significant byte. This is basically same as the human nature in reading/writing a number. On the other hand, Little-Endian doing the opposite way by reading/writing the least significant byte first and subsequently until the most significant byte. An easy way to memorize them is: Big-Endian is big number start first; Little-Endian is little number start first.

UTF-16 and UTF-32 rely on Byte Order Mark (BOM), which is the special Unicode character to indicate the endianness of the byte stream. BOM has code point U+FEFF, which is a 2 bytes number. FE is the most significant byte, and FF is the least significant byte. By only looking to the individual bytes, you may confused that FF should be the most significant byte as it is bigger than FE, this is wrong! In fact, FE is most significant byte because it is FE00, add with FF to become FEFF. BOM has to be attached in the starting of the byte stream. So that it could be examined first and determine the order of the following byte stream. The program below showing the code units of encoded BOM by different encoding schemes.

public static void main(String[] args) {
    printBom(Charset.forName("UTF-8"));
    printBom(Charset.forName("UTF-16"));
    printBom(Charset.forName("UTF-16LE"));
    printBom(Charset.forName("UTF-16BE"));
    printBom(Charset.forName("UTF-32"));
    printBom(Charset.forName("UTF-32LE"));
    printBom(Charset.forName("UTF-32BE"));
}

private static void printBom(Charset charset) {
    String bom = "\uFEFF";
    System.out.format("%s >> ", charset.displayName());
    for (byte b : bom.getBytes(charset)) {
        System.out.format("%02x ", b);
    }
    System.out.println("");
}

Result:

EncodingCode Units
UTF-8ef bb bf
UTF-16feff feff
UTF-16LEfffe
UTF-16BEfeff
UTF-320000feff
UTF-32LEfffe0000
UTF-32BE0000feff

How to use UTF-16?

When encoding, UTF-16 always write Big-Endian BOM in the beginning and follow by byte stream in Big-Endian order.

public static void main(String[] args) {
    String bom = "a";
    for (byte b : bom.getBytes(Charset.forName("UTF-16"))) {
        System.out.format("%02x ", b);
    }
    System.out.println("");
    //Result(in hexdecimal): fe ff 00 61
    //Result(in decimal): -2 -1 00 97
}

When decoding, UTF-16 will examine the BOM in order to determine the endianness of the following byte stream. If no BOM provided, then it will assume the byte stream in Big-Endian order.

public static void main(String[] args) {
    Charset utf16 = Charset.forName("UTF-16");
    byte[] beBytes = new byte[]{-2, -1, 00, 97};
    String beResult = new String(beBytes, utf16);

    byte[] leBytes = new byte[]{-1, -2, 97, 00};
    String leResult = new String(leBytes, utf16);

    byte[] noBomBeBytes = new byte[]{00, 97};
    String noBomBeResult = new String(noBomBeBytes, utf16);

    byte[] noBomLeBytes = new byte[]{97, 00};
    String noBomLeResult = new String(noBomLeBytes, utf16);

    System.out.format("Big Endian Decoded Result(beResult): %s%n"
            + "Little Endian Decoded Result(leResult): %s%n"
            + "No BOM Big Endian Decoded Result(noBomBeResult): %s%n"
            + "No BOM Little Endian Decoded Result(noBomLeResult): %s%n"
            + "beResult equals leResult: %b%n"
            + "beResult equals noBomBeResult: %b%n"
            + "beResult equals noBomLeResult: %b%n",
            beResult, leResult, noBomBeResult, noBomLeResult,
            beResult.equals(leResult), beResult.equals(noBomBeResult),
            beResult.equals(noBomLeResult));
}

Result:
Big Endian Decoded Result(beResult): a
Little Endian Decoded Result(leResult): a
No BOM Big Endian Decoded Result(noBomBeResult): a
No BOM Little Endian Decoded Result(noBomLeResult): 愀
beResult equals leResult: true
beResult equals noBomBeResult: true
beResult equals noBomLeResult: false

Potential pitfall of using UTF-16 with multiple texts

Be careful when using UTF-16 to encodes multiple texts into a single byte stream. This is because it will always add BOM for every single text. The program below demonstrates the problem.

public static void main(String[] args) throws IOException {
    Charset utf16 = Charset.forName("UTF-16");
    String ping = "ping";
    String pong = "pong";
    String pingpong = ping + pong;

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    baos.write(ping.getBytes(utf16));
    baos.write(pong.getBytes(utf16));

    byte[] encodedBytes = baos.toByteArray();
    printBytes(encodedBytes);
    String decodedResult = new String(encodedBytes, utf16);

    System.out.format("DecodeResult: %s%nDecodeResult equals original: %b%n",
            decodedResult, decodedResult.equals(pingpong));
}

private static void printBytes(byte[] bytes) {
    for (byte b : bytes) {
        System.out.format("%02x ", b);
    }
    System.out.println("");
}

Result:
fe ff 00 70 00 69 00 6e 00 67 fe ff 00 70 00 6f 00 6e 00 67
DecodeResult: pingpong
DecodeResult equals original: false

Solution

Direct dealing with encoded/decoded bytes is always a bad idea. While we still can handle the BOM issue introduced by UTF-16 above by ourselves, why not just leave the hard work to existing Java API. Use the java.io.OutputStreamWriter instead, and the issue will be resolved.

public static void main(String[] args) throws IOException {
    Charset utf16 = Charset.forName("UTF-16");
    String ping = "ping";
    String pong = "pong";
    String pingpong = ping + pong;

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    OutputStreamWriter osw = new OutputStreamWriter(baos, utf16);
    osw.write(ping);
    osw.write(pong);
    osw.flush();

    byte[] encodedBytes = baos.toByteArray();
    printBytes(encodedBytes);
    String decodedResult = new String(encodedBytes, utf16);

    System.out.format("DecodeResult: %s%nDecodeResult equals original: %b%n",
            decodedResult, decodedResult.equals(pingpong));
}

Result:
fe ff 00 70 00 69 00 6e 00 67 00 70 00 6f 00 6e 00 67
DecodeResult: pingpong
DecodeResult equals original: true

How UTF-16 encodes bytes in Little-Endian order?

After all, you may wonder how UTF-16 to encode bytes in Little-Endian order. It can't, because it is by default write Big-Endian BOM only. Hence. to generate Little-Endian byte stream with Little-Endian BOM, we have to make use of x-UTF-16LE-BOM.

public static void main(String[] args) {
    String bom = "a";
    for (byte b : bom.getBytes(Charset.forName("x-UTF-16LE-BOM"))) {
        System.out.format("%02x ", b);
    }
    System.out.println("");
    //Result(in hexdecimal): ff fe 61 00
    //Result(in decimal):  -1 -2 97 00 
}

UTF-16 in other forms

BOM is not the only approach to deal with endianness. There are UTF-16BE and UTF-16LE encoding scheme which describe endianness natural of their byte stream by their name. Due to this, they do not require BOM in their byte stream. When encoding, they never write BOM. When decoding, if BOM exist, they will take it as a ZERO-WIDTH NON-BREAKING SPACE.

Using UTF-16 is a bit tricky. Hence, make sure you are using the correct encoding scheme for encoding and decoding.

Encoded ByDecoded ByResultReason
UTF-16UTF-16BEOKEncoded bytes of UTF-16 is Big-Endian, the initial BOM will be
 treated as ZERO-WIDTH NON-BREAKING SPACE
UTF-16UTF-16LENOEncoded bytes of UTF-16 is Big-Endian, cannot be decoded
correctly by Little-Endian encoding.
UTF-16BEUTF-16OKWhen there is no initial BOM, UTF-16 will assume the byte stream
 is Big-Endian
UTF-16BEUTF-16LENOThe endian of encoding scheme used for encoding and decoding
is totally contrast.
UTF-16LEUTF-16NOUTF-16 is only able to decode Big-Endian byte stream if there is
no initial BOM.
x-UTF-16LE-BOMUTF-16OKUTF-16 could interpret the initial BOM
The same thing is this section applied to UTF-32, UTF-32BE, and UTF-32LE.

BOM in UTF-8

BOM is not necessary for UTF-8 because it does not have the endianness issue. Each of UTF-8 code unit is 8-bits, which is same as the computer memory unit. However, it is not prohibited to put the initial BOM to UTF-8 encoded byte stream. The initial BOM will be treated as ZERO-WIDTH NON-BREAKING SPACE. You could read this Wikipedia page to get some idea why people still using BOM for UTF-8.

Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Surrogate Characters Mechanism
Unicode Regular Expression
Characters Normalization
Text Collation

References:
http://en.wikipedia.org/wiki/Endianness
http://en.wikipedia.org/wiki/Byte_order_mark
http://en.wikipedia.org/wiki/UTF-16
http://en.wikipedia.org/wiki/UTF-8
http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html

No comments: