Previous Page Next Page

9.2. Unicode

For every character, Unicode specifies a unique identification number that remains consistent across applications, languages, and platforms.

With the advent of the Internet, it became obvious that the ASCII coding for characters was insufficient if the whole world were to be included in transferring data from one Web site to another without corrupting the data. The ASCII sequence of characters consists of only 256 (one-byte) characters and could hardly accommodate languages like Chinese and Japanese, where a given symbol is drawn from a set of thousands of characters.

The Unicode standard is an effort to solve the problem by creating new characters sets, called UTF8 and UTF16, where characters are not limited to one byte. UTF8, for example, allows two bytes that can hold up to 65,536 characters, and each character has a unique number. To remove ambiguity, any given 16-bit value would always represent the same character, thereby allowing for consistent sorting, searching, displaying, and editing of text. According to the Unicode Consortium,[5] Unicode has the capacity to encode over one million characters, which is sufficient to encompass all the world's written languages. Further, all symbols are treated equally, so that all characters can be accessed without the need for escape sequences or control codes.

[5] The Unicode Consortium is a nonprofit organization founded to develop, extend, and promote use of the Unicode standard. For more information on Unicode and the Unicode Consortium, go to http://www.unicode.org/unicode/standard/whatisunicode.html.

9.2.1. Perl and Unicode

The largest change in Perl 5.6 was to provide UTF8 Unicode support. By default, Perl represents strings internally in Unicode, and all the relevant built-in functions (length, reverse, sort, tr) now work on a character-by-character basis instead of on a byte-by-byte basis. Two new Perl pragmas are used to turn Unicode settings on and off. The utf8 pragma turns on the Unicode settings and loads the required character tables, while the bytes pragma refers to the old byte meanings, reading one byte at a time.

When utf8 is turned on, you can specify string literals in Unicode using the \x{N} notation, where N is a hexadecimal character code such as \x{395}.

Unicode also provides support for regular expressions and matching characters based on Unicode properties, some of which are defined by the Unicode standard and some by Perl. The Perl properties are composites of the standard properties; in other words, you can now match any uppercase character in any language with \p{IsUpper}. For more information, go to http://www.perl.com/pub/a/2000/04/whatsnew.html.

Table 9.10 is a list of Perl's composite character classes. If the p in \p is capitalized, the meaning is a negation; so, for example, \p{IsASCII} represents an ASCII character, whereas \P{IsASCII} represents a non-ASCII character.

Table 9.10. utf8 Composite Character Classes
utf8 PropertyMeaning
\p{IsASCII}ASCII character
\p{Cntrl}Control character
\p{IsDigit}A digit between 0 and 9
\p{IsGraph}Alphanumeric or punctuation character
\p{IsLower}Lowercase letter
\p{IsPrint}Alphanumeric, punctuation character, or space
\p{IsPunct}Any punctuation character
\p{IsSpace}Whitespace character
\p{IsUpper}Uppercase letter
\p{IsWord}Alphanumeric word character or underscore
\p{IsXDigit}Any hexadecimal digit


Example 9.52.

1  use utf8;
2  $chr=11;
3  print "$chr is a digit.\n"if $chr =~ /\p{IsDigit}/;
4  $chr = "junk";
5  print "$chr is not a digit.\n"if $chr =~ /\P{IsDigit}/;
6  print "$chr is not a control character.\n"if $chr =
       ~ /\P{IsCntrl}/;

(Output)
3   11 is a digit.
5   junk is not a digit.
6   junk is not a control character.

Explanation

  1. The utf8 pragma is used to turn on the Unicode settings.

  2. Scalar $chr is assigned a number.

  3. The Perl Unicode property IsDigit is used to check for a number between 0 and 9, the same as using [0–9].

  4. Scalar $chr is assigned the string junk.

  5. The \p is now \P, causing the escape sequence to mean not a digit, the same as using [^0–9]. Since junk is not a digit, the condition is true.

  6. The opposite of junk is not a control character.

Previous Page Next Page