How to perform encoding and decoding in Java?
Using Charset class
We use java.nio.charset.Charset to instantiate the encoding scheme that we want to use. It provides the following methods for encoding and decoding.- Encoding
- ByteBuffer encode(CharBuffer cb)
- ByteBuffer encode(String str)
- Decoding
- CharBuffer decode(ByteBuffer bb)
Using String class
While Charset is coming from the perspective of encoding scheme, if we would like to start from the perspective of texts/characters, then java.lang.String is the class to go as it is used to handle texts/characters in most of the time. It provides the following API for encoding characters and decoding byte arrays.- Encoding
- byte[] getBytes()
- byte[] getBytes(Charset charset)
- byte[] getBytes(String charsetName)
- Decoding
- String (byte[] bytes)
- String (byte[] bytes, Charset charset)
- String (byte[] bytes, int offset, int length)
- String (byte[] bytes, int offset, int length, Charset charset)
- String (byte[] bytes, int offset, int length, String charsetName)
- String (byte[] bytes, String charsetName)
Using CharsetEncoder and CharsetDecoder class
Furthermore, if you want more control over encoding/decoding, then you can use java.nio.CharsetEncoder and java.nio.CharsetDecoder instance which is created by java.nio.charset.Charset. For example,String copyRight = "\u00A9"; Charset charset = Charset.forName("US-ASCII"); byte[] bytes = copyRight.getBytes(charset); System.out.println(new String(bytes, charset)); // print default replacement "?"
The code sample above using String to encode and decode a character. However, that character encoding scheme (ASCII) is not able to encode/decode the character, and hence it falls back to default replacement character, which is "?" mark. Using CharcetEncoder and CharsetDecoder, we could set the replacement character as we want.
String copyRight = "\u00A9"; Charset charset = Charset.forName("US-ASCII"); CharsetEncoder encoder = charset.newEncoder(); // When character is unmapped, trigger replace action encoder.onUnmappableCharacter(CodingErrorAction.REPLACE); encoder.replaceWith("!".getBytes(charset)); // set Exclamation mark as replacement ByteBuffer byteBuffer = encoder.encode(CharBuffer.wrap(copyRight.toCharArray())); byte[] bytes = new byte[byteBuffer.limit()]; byteBuffer.get(bytes); System.out.println(new String(bytes, charset)); // print new replacement "!"
What is the bytes outcome of encoded character?
Different encoding schemes give different of encoded results for a particular character/code point. Same applied to those encoding schemes that based on the same standard such as UTF-8 and UTF-16 which based on Unicode standard.public static void main(String[] args) { printEncodedBytesForOmega(Charset.forName("UTF-8")); printEncodedBytesForOmega(Charset.forName("UTF-16BE")); printEncodedBytesForOmega(Charset.forName("IBM935")); printEncodedBytesForOmega(Charset.forName("Cp949")); printEncodedBytesForOmega(Charset.forName("MS932")); } private static void printEncodedBytesForOmega(Charset charset) { byte[] bytes = "Ω".getBytes(charset); System.out.format("%s >> ", charset.displayName()); for (byte b : bytes) { System.out.format("%02x ", b); } System.out.println(""); }
The code sample above giving the result as below:
Encoding Scheme | Encoded Bytes (Hex) |
---|---|
UTF-8 | ce a9 |
UTF-16BE | 03 a9 |
x-IBM935 | 0e 41 78 0f |
x-IBM949 | a5 d8 |
windows-31j | 83 b6 |
public static void main(String[] args) { String omega = "Ω"; byte[] encodedBytes = omega.getBytes(Charset.forName("UTF-8")); String decodedString = new String(encodedBytes, Charset.forName("UTF-16BE")); System.out.format("Decoded string equals to original string: %b,%n" + "Decoded string: %s%n", decodedString.equals(omega), decodedString); // Decoded string equals to original string: false, // Decoded string: 캩 }
The size of the encoded bytes sequence/number of code unit is not fixed. For example, UTF-8 could take 1 to 4 code units (each code unit 1 byte) for a code point, while UTF-16 could take 1 to 2 code units(each code unit 2 bytes) for a code point.
Characters | Unicode Name | Code point | UTF-8 Code Units | UTF-16BE Code Units | UTF-32BE Code Units |
---|---|---|---|---|---|
a | Latin Small Letter A | U+0061 | 61 | 0061 | 00000061 |
© | Copyright sign | U+00A9 | C2 A9 | 00A9 | 000000A9 |
₂ | Subscript two | U+2082 | E2 82 82 | 2082 | 00002082 |
𝅘𝅥𝅮 | Musical Symbol Eighth Note | U+1D160 | F0 9D 85 A0 | D834 DD60 | 0001D160 |
Potential pitfall of fixing the bytes container size
Due to the fact above, never assume that 1 byte is equivalent to 1 code point. Fix the number for byte array size for each code point is not working as well. The program below showing this problem.public static void main(String[] args) throws IOException { Charset charset = Charset.forName("UTF-8"); String original = "a©₂\uD834\uDD60"; byte[] encoded = original.getBytes(charset); InputStream stream = new ByteArrayInputStream(encoded); String decoded = decodeFromStream(stream, charset); System.out.format("Original: %s%nDecoded:%s%nOriginal==Decoded:%b%n", original, decoded, original.equals(decoded)); } private static String decodeFromStream(InputStream stream, Charset encoding) throws IOException { StringBuilder builder = new StringBuilder(); byte[] buffer = new byte[1]; // Change the byte array size doesn't help while (true) { int r = stream.read(buffer); if (r < 0) { break; } String data = new String(buffer, 0, r, encoding); builder.append(data); } return builder.toString(); }
Result:
Original: a©₂𝅘𝅥𝅮
Decoded:a?????????
Original==Decoded:false
Original: a©₂𝅘𝅥𝅮
Decoded:a?????????
Original==Decoded:false
Solution
Instead of direct dealing with encoded byte stream, use java.io.InputStreamReader to decodes the byte stream. It is able to decode byte stream and return code point accordingly.private static String decodeFromStream(InputStream stream, Charset encoding) throws IOException { StringBuilder builder = new StringBuilder(); InputStreamReader isr = new InputStreamReader(stream, encoding); while (true) { //What has been read is codepoint, instead of encoded code unit. int codepoint = isr.read(); if (codepoint < 0) { break; } builder.append(Character.toChars(codepoint )); } return builder.toString(); }
Result:
Original: a©₂𝅘𝅥𝅮
Decoded:a©₂𝅘𝅥𝅮
Original==Decoded:true
Original: a©₂𝅘𝅥𝅮
Decoded:a©₂𝅘𝅥𝅮
Original==Decoded:true
Related topics:
Character Encoding Terminologies
Unicode Support in Java
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Unicode Regular Expression
Characters Normalization
Text Collation
References:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetDecoder.html
http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html
No comments:
Post a Comment