Sunday, May 10, 2015

Character Encoding Terminologies

Character Encoding Terminologies

Down to the root, the computer could only recognize number in bit sequence. It relies on character encoding scheme to recognizes characters/symbols from human's world. Before we dive into programming with Unicode, it is important to know the character encoding terminologies in order to understand how it works when come to programming in Java.

General Character Encoding

General character encoding overview

Character is a generic term which representing a normal character in any languages, symbol, control character, formatting character, and etc. It is an abstract form which just a piece of textual data without any specific appearance or style. In any general character encoding, each of the characters is encoded into a unique integer value called as Code Point. Once the bonding between character and code point is established, this relationship is never changed and hence called as Encoded Character. Every character encoding scheme can only cope with certain Code Space, which is the range of integer value or total number of code points, depend on the bit-length that they are supporting.

Encoding Bit lengthCode space size
ASCII 7128
WINDOWS-12528256
On the other hand, the code space has determined the maximum number of distinct characters that the corresponding Character Set could contain. The term of Character set is a standard. It defines and contains only certain of characters. There are a lot of character sets out there. For example, Windows has various character sets specifically for different languages such as Arabic (Window 1256), Korean (Windows 949), Western European (Windows 1252) and etc.

CharSet basically is the short form of Coded Character Set, a set of characters which are encoded by specific character encoding scheme. Therefore, charset name most of the time same as character encoding name. Code Page is referring to the same thing, but this term is normally used by computer vendors. For example, UTF-8 has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP.

What displayed on the screen is called Glyph Image. Font is a collection of glyphs with certain format and style setting. A font type does not necessary to cover all characters from a character set. For example, Arial font is not able to render Japanese characters because it does not contain any glyph for Japanese characters. What is probably rendered on the screen is just a square box. This means the textual data is there but no glyph to depict and render it on the screen. The Japanese characters will be rendered correctly when we use Arial Unicode MS instead.

Unicode Encoding

Unicode encoding


























Universal Character Set (UCS) is an International Standard(ISO/IEC 10646) that defines a standard set of characters that cover all characters from all around the World. It is developed in conjunction with Unicode standard, and hence it's size is determined by Unicode encoding code space. Initially, Unicode was only intended to covers all characters from all modern languages. The Unicode-based encoding scheme at that time was USC-2 (2-byte Universal Character Set), which is 16-bit and it has 65536 code space. However, since Unicode 2.0, UCS-2 was superseded by UTF-16 and Unicode code space is greatly extended to 1,114,112 by using surrogate characters mechanism. In other words, there are over 1 million code points or character placeholders available for characters in USC. USC currently contains nearly one hundred thousand abstract characters and there is still a lot of room for growth.

In term of the code point, Unicode taking the same approach, but it treats the code point as a layer between the character and the encoded value. Code point in Unicode has the notation U+following by hexadecimal numbers. Besides, Unicode also defines properties such as Name,Category, Script, Block, different Binary type and etc for each character. Each character is still mapped to a unique code point. Code point then is encoded by the Unicode-based character encoding scheme. The well-known encoding such as UTF-8, UTF-16, and UTF-32. Different Unicode encoding would encode the same code point into different bytes sequence(represented in hexadecimal format) with variable length of Code Unit. Example as below.

Characters Unicode NameCode pointUTF-8 Code UnitsUTF-16BE Code UnitsUTF-32BE Code Units
aLatin Small Letter AU+006161006100000061
©Copyright signU+00A9C2 A900A9000000a9
Subscript twoU+2082E2 82 82208200002082
𝅘𝅥𝅮Musical Symbol Eighth NoteU+1D160F0 9D 85 A0D834 DD600001D160
A question to wrap up. UTF-16 could encode over 1 million code points, how about UTF-8? It probably could handle half of code space of UTF-16? The answer is No, UTF-8 also having the same code space as UTF-16. I hope you understand why.

No comments: