Sunday, May 10, 2015

Unicode Regular Expression

Regular Expression (Regex) Character Classes

We use character class, which is a short character sequence written in pattern expression to match a character from a large set of characters. Java supports different type of regex character classes.

Predefined Character Classes

The predefined character classes are the most commonly used regex character class in regex pattern expression. However, the characters that it managed to cover are very limited. While it is good for simple English application, but it does not help if your application require internalization.

Predefined character classes

.  Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
public static void main(String[] args) {      
   System.out.println(match("\\w", "~")); //false
   System.out.println(match("\\w", "神")); //false
   System.out.println(match("\\d", "๓")); //false
}

static boolean match(String pattern, String str) {       
    Pattern p = Pattern.compile(pattern);
    return p.matcher(str).matches();
}

POSIX Character Classes

Java also supports POSIX (Portable Operating System Interface) character classes. POSIX is a standard to maintains compatibility between operating systems. In other words, those character classes can be used in any system or application that comply to POSIX standard. POSIX character classes cover more characters compare to predefined character classes but within the boundary of ASCII character set, and still not enough for handling internalization. They are written in "\p{<class_name>}" construct for matching characters which fall on the defined class, "\P{<class_name>}" construct for matching characters which NOT come from the defined class. <class_name> are case-sensitive.

POSIX character classes

\p{Lower} A lower-case alphabetic character: [a-z]
\p{Upper}  An upper-case alphabetic character:[A-Z]
\p{ASCII}  All ASCII:[\x00-\x7F]
\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}  A decimal digit: [0-9]
\p{Alnum}  An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}  Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}  A visible character: [\p{Alnum}\p{Punct}]
\p{Print}  A printable character: [\p{Graph}\x20]
\p{Blank}  A space or a tab: [ \t]
\p{Cntrl}  A control character: [\x00-\x1F\x7F]
\p{XDigit} A hexadecimal digit: [0-9a-fA-F]
\p{Space}  A whitespace character: [ \t\n\x0B\f\r]
System.out.println(match("\\p{Punct}", "~")); //true
System.out.println(match("\\p{Print}", "神")); //false       
System.out.println(match("\\p{Digit}", "๓")); //false

Unicode Character Classes

Unicode defined a large series of regex character classes to match Unicode characters based on the properties that set to the characters. These include Categories, Binary, Blocks, and Scripts Properties. Character classes are written in "\p{}" construct for matching the characters which have the specified properties, "\P{}" construct for matching the characters which DO NOT has the specified properties. Since Java complies with Unicode standard, Java also supported those character classes.

Unicode Categories Properties Character Classes

There are 2 level of categories properties. The main category and its sub-categories. Sub-categories allow us to match even more specific characters based on our needs.
  • \p{L}: any kind of letter from any language.
    • \p{Ll}: a lowercase letter that has an uppercase variant.
    • \p{Lu}: an uppercase letter that has a lowercase variant.
    • \p{Lt}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm}: a special character that is used like a letter.
    • \p{Lo}: a letter or ideograph that does not have lowercase and uppercase variants.
  • \p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
  • \p{Z}: any kind of whitespace or invisible separator.
    • \p{Zs}: a whitespace character that is invisible, but does take up space.
    • \p{Zl}: line separator character U+2028.
    • \p{Zp}: paragraph separator character U+2029.
  • \p{S}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm}: any mathematical symbol.
    • \p{Sc}: any currency sign.
    • \p{Sk}: a combining character (mark) as a full character on its own.
    • \p{So}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N}: any kind of numeric character in any script.
    • \p{Nd}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl}: a number that looks like a letter, such as a Roman numeral.
    • \p{No}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P}: any kind of punctuation character.
    • \p{Pd}: any kind of hyphen or dash.
    • \p{Ps}: any kind of opening bracket.
    • \p{Pe}: any kind of closing bracket.
    • \p{Pi}: any kind of opening quote.
    • \p{Pf}: any kind of closing quote.
    • \p{Pc}: a punctuation character such as an underscore that connects words.
    • \p{Po}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C}: invisible control characters and unused code points.
    • \p{Cc}: an ASCII 0x00–0x1F or Latin-1 0x80–0x9F control character.
    • \p{Cf}: invisible formatting indicator.
    • \p{Co}: any code point reserved for private use.
    • \p{Cs}: one-half of a surrogate pair in UTF-16 encoding.
    • \p{Cn}: any code point to which no character has been assigned.
System.out.println(match("\\p{S}", "~")); //true
System.out.println(match("\\p{L}", "神")); //true       
System.out.println(match("\\p{N}", "๓")); //true

Other ways to denote Category properties character classes in Java, such as "\p{Is}" construct as below.

System.out.println(match("\\p{IsS}", "~")); //true
System.out.println(match("\\p{IsL}", "神")); //true
System.out.println(match("\\p{IsN}", "๓")); //true

using the "general_category" keyword as below.

System.out.println(match("\\p{general_category=S}", "~")); //true
System.out.println(match("\\p{general_category=L}", "神")); //true
System.out.println(match("\\p{general_category=N}", "๓")); //true

using the short form "gc" keyword as below.

System.out.println(match("\\p{gc=S}", "~")); //true
System.out.println(match("\\p{gc=L}", "神")); //true
System.out.println(match("\\p{gc=N}", "๓")); //true

Unicode Binary Properties Character Classes

Add "Is" prefix to the following binary properties. Below are binary properties supported in Java.
  • Alphabetic
  • Ideographic
  • Letter
  • Lowercase
  • Uppercase
  • Titlecase
  • Punctuation
  • Control
  • White_Space
  • Digit
  • Hex_Digit
  • Noncharacter_Code_Point
  • Assigned
System.out.println(match("\\p{IsAssigned}", "~")); //true
System.out.println(match("\\p{IsIdeographic}", "神")); //true
System.out.println(match("\\p{IsDigit}", "๓")); //true

Unicode Block Properties Character Classes

Characters in Universal Character Set is arranged in organized way with proper planned. Every character is resided in a specific block, and hence it can be used for matching characters from a specific block. The valid block names are defined by java.lang.Character.UnicodeBlock. Add "In" prefix to block name.

System.out.println(match("\\p{InBASIC_LATIN}", "~")); //true
System.out.println(match("\\p{InCJK_UNIFIED_IDEOGRAPHS}", "神")); //true
System.out.println(match("\\p{InTHAI}", "๓")); //true

Java accepts "Block" keyword as below.

System.out.println(match("\\p{Block=BASIC_LATIN}", "~")); //true
System.out.println(match("\\p{Block=CJK_UNIFIED_IDEOGRAPHS}", "神")); //true
System.out.println(match("\\p{Block=THAI}", "๓")); //true

and short form "blk" keyword as below.

System.out.println(match("\\p{Blk=BASIC_LATIN}", "~")); //true
System.out.println(match("\\p{Blk=CJK_UNIFIED_IDEOGRAPHS}", "神")); //true
System.out.println(match("\\p{Blk=THAI}", "๓")); //true

Unicode Script Properties Character Classes

Since Java 7.0, Java was enhanced to support Unicode Script properties. The valid script names are defined in java.lang.Character.UnicodeScript. Add "Is" prefix to script name.

System.out.println(match("\\p{IsCOMMON}", "~")); //true
System.out.println(match("\\p{IsHAN}", "神")); //true
System.out.println(match("\\p{IsTHAI}", "๓")); //true

Java accepts "Script" keyword as below,

System.out.println(match("\\p{script=COMMON}", "~")); //true
System.out.println(match("\\p{script=HAN}", "神")); //true
System.out.println(match("\\p{script=THAI}", "๓")); //true

and short form "sc" keyword as below.

System.out.println(match("\\p{sc=COMMON}", "~")); //true
System.out.println(match("\\p{sc=HAN}", "神")); //true
System.out.println(match("\\p{sc=THAI}", "๓")); //true

Pattern Configuration Flags

java.lang.Pattern could be configured by setting flags to enable certain matching criteria. Those flags are constant and could be mixed and pass into Pattern for compilation as a bitmask. We could specify flags explicitly as below.

Pattern.compile("aeiou", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

We could also embed flags straight away in our pattern string, provided the flag support embedded flag. The code line below giving the same effect as the code line above.

Pattern.compile("(?iu)aeiou");

The flags that involve internalization is UNICODE_CASE(?u), CANON_EQ, and UNICODE_CHARACTER_CLASS(?U). By default, they are disabled. Enable them may impose performance penalty according to Pattern's JavaDoc.

Enable Unicode version of Predefined and POSIX character classes

Started from Java 7.0, Pattern supports a new flag called UNICODE_CHARACTER_CLASS. Specifying this flag would extend the capabilities of Predefined and POSIX character classes to cover Unicode characters.

ClassesMatches
\p{Lower}A lowercase character:p{IsLowercase}
\p{Upper}An uppercase character:p{IsUppercase}
\p{ASCII}All ASCII:[x00-x7F]
\p{Alpha}An alphabetic character:p{IsAlphabetic}
\p{Digit}A decimal digit character:p{IsDigit}
\p{Alnum}An alphanumeric character:[p{IsAlphabetic}p{IsDigit}]
\p{Punct}A punctuation character:p{IsPunctuation}
\p{Graph}A visible character: [^p{IsWhite_Space}p{gc=Cc}p{gc=Cs}p{gc=Cn}]
\p{Print}A printable character: [p{Graph}p{Blank}&&[^p{Cntrl}]]
\p{Blank}A space or a tab: [p{IsWhite_Space}&&[^p{gc=Zl}p{gc=Zp}x0ax0bx0cx0dx85]]
\p{Cntrl}A control character: p{gc=Cc}
\p{XDigit}A hexadecimal digit: [p{gc=Nd}p{IsHex_Digit}]
\p{Space}A whitespace character:p{IsWhite_Space}
\dA digit: p{IsDigit}
\D


A non-digit: [^d]
\s


A whitespace character: p{IsWhite_Space}
\S


A non-whitespace character: [^s]
\w

A word character: [p{Alpha}p{gc=Mn}p{gc=Me}p{gc=Mc}p{Digit}p{gc=Pc}]
\W


A non-word character: [^w]
public static void main(String[] args) {
    System.out.println("Predefined Character Class: ");
    String preDefinedDigitPattern = "\\d";
    test(preDefinedDigitPattern);

    System.out.println("POSIX Character Class: ");
    String posixDigitPattern = "\\p{Digit}";
    test(posixDigitPattern);
}

private static void test(String pattern) {
    System.out.println("\tWithout Pattern.UNICODE_CHARACTER_CLASS setting");
    Pattern p = Pattern.compile(pattern);
    apply(p);

    System.out.println("\tAfter apply Pattern.UNICODE_CHARACTER_CLASS setting");
    p = Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS);
    apply(p);
}

private static void apply(Pattern p) {
    String normalDigit3 = "3";
    String fullWidthFormDigit3 = "3";

    System.out.format("\t\tString: %s, match: %b%n",
            normalDigit3, p.matcher(normalDigit3).matches()); 
    System.out.format("\t\tString: %s, match: %b%n",
            fullWidthFormDigit3, p.matcher(fullWidthFormDigit3).matches()); 
}

Result:
Predefined Character Class:
Without Pattern.UNICODE_CHARACTER_CLASS setting
String: 3, match: true
String: 3, match: false
After apply Pattern.UNICODE_CHARACTER_CLASS setting
String: 3, match: true
String: 3, match: true
POSIX Character Class:
Without Pattern.UNICODE_CHARACTER_CLASS setting
String: 3, match: true
String: 3, match: false
After apply Pattern.UNICODE_CHARACTER_CLASS setting
String: 3, match: true
String: 3, match: true

Case Insensitive Matching

By default, Pattern matching is case-sensitive. Specifying CASE_INSENSITIVE would turn the matching to be case-insensitive, but this will only work for ASCII characters. In order to enable case-insensitive for Unicode characters, we need to specify UNICODE_CASE together with CASE_INSENSITIVE.

public static void main(String[] args) {

    String pattern = "i";

    System.out.println("Case Sensitive");
    Pattern p = Pattern.compile(pattern);
    test(p);

    System.out.println("ASCII Case In-Sensitive");
    p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
    test(p);

    System.out.println("Unicode Case In-Sensitive");
    p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    test(p);
}

private static void test(Pattern p) {
    String i = "i";
    String capitalI = i.toUpperCase();
    String turkishCapitalI = i.toUpperCase(Locale.forLanguageTag("tr"));

    System.out.format("Input: %s, match: %b%n", 
           i, p.matcher(i).matches());
    System.out.format("Input: %s, match: %b%n",
            capitalI, p.matcher(capitalI).matches());
    System.out.format("Input: %s, match: %b%n",
            turkishCapitalI, p.matcher(turkishCapitalI).matches());
}

Result:
Case Sensitive
Input: i, match: true
Input: I, match: false
Input: İ, match: false
ASCII Case In-Sensitive
Input: i, match: true
Input: I, match: true
Input: İ, match: false
Unicode Case In-Sensitive
Input: i, match: true
Input: I, match: true
Input: İ, match: true

Canonical Equivalent Matching

Character matching become tricky as Unicode allow combining character sequence to represent an abstract character. Read Character Normalization. By specifying the CANON_EQ flag, it will apply Normalization Form Canonical Decomposition (NFD) during the matching process.

public static void main(String[] args) {
    String pattern = "\u00E0";

    System.out.println("By default, canonical equivalence does not take into account.");
    Pattern p = Pattern.compile(pattern);
    test(p);

    System.out.println("");

    System.out.println("Enable canonical equivalent flag.");
    p = Pattern.compile(pattern, Pattern.CANON_EQ);
    test(p);
}

private static void test(Pattern p) {
    String aWithGraveAccent = "à"; //\u00E0
    String aWithGraveAccentCombine = "à"; //\u0061\u0300

    System.out.format("Pattern: %s%n", p.pattern());

    Matcher m = p.matcher(aWithGraveAccent);
    System.out.format("match %s: %b%n", aWithGraveAccent, m.matches());

    m = p.matcher(aWithGraveAccentCombine);
    System.out.format("match %s: %b%n", aWithGraveAccentCombine, m.matches());
}

Result:
By default, canonical equivalence does not take into account.
Pattern: à
match à: true
match à: false

Enable canonical equivalent flag.
Pattern: à
match à: true
match à: true

In the case where, CANON_EQ flag is enabled, but you don't want the pattern character get affected by NFD, you could specify double backslashes to the Unicode escape sequence.


public static void main(String[] args) {
    String pattern = "\\u00E0";
    Pattern p = Pattern.compile(pattern, Pattern.CANON_EQ);
    test(p);
}

Result:
Pattern: \u00E0
match à: true
match à: false

Related topics:
Character Encoding Terminologies
Unicode Support in Java
Encoding and Decoding
Endianness and Byte Order Mark (BOM)
Surrogate Characters Mechanism
Characters Normalization
Text Collation

References:
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
http://www.regular-expressions.info/unicode.html

No comments: