As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.
You can match a specific Unicode code point using an escape sequence of the form \uFFFF
, where FFFF
is the hexidecimal value of the code point you want to match. For example, \u6771
matches the Han character for east.
Alternatively, you can specify a code point using Perl-style hex notation, \x{...}
. For example:
String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";
Each Unicode character, in addition to its value, has certain attributes, or properties. You can match a single character belonging to a particular category with the expression \p{prop}
. You can match a single character not belonging to a particular category with the expression \P{prop}
.
The three supported property types are scripts, blocks, and a "general" category.
To determine if a code point belongs to a specific script, you can either use the script
keyword, or the sc
short form, for example, \p{script=Hiragana}
. Alternatively, you can prefix the script name with the string Is
, such as \p{IsHiragana}
.
Valid script names supported by Pattern
are those accepted by
UnicodeScript.forName
.
A block can be specified using the block
keyword, or the blk
short form, for example, \p{block=Mongolian}
. Alternatively, you can prefix the block name with the string In
, such as \p{InMongolian}
.
Valid block names supported by Pattern
are those accepted by
UnicodeBlock.forName
.
Categories can be specified with optional prefix Is
. For example, IsL
matches the category of Unicode letters. Categories can also be specified by using the general_category
keyword, or the short form gc
. For example, an uppercase letter can be matched using general_category=Lu
or gc=Lu
.
Supported categories are those of
The Unicode Standard in the version specified by the
Character
class.