unicode - Perl pragma ..
use unicode 'UTF-8';
use warnings FATAL => 'unicode'; # recommended
my $octets = "\xE2\x98\xBA"; # U+263A WHITE SMILING FACE
my $string = unicode::decode($octets);
say unicode::encoding; # UTF-8
# Lexically scoped encoding
{
use unicode 'UTF-16';
say unicode::encoding; # UTF-16
$octets = unicode::encode($string);
}
# Dynamically scoped encoding
{
package Foo;
use unicode;
sub process {
my ($string) = @_;
say "encoding string to " . unicode::encoding;
return unicode::encode($string);
}
package Bar;
use unicode 'UTF-32';
sub process {
my ($string) = @_;
return Foo->process($string);
}
}
$octets = Foo->process($string); # UTF-8
$octets = Bar->process($string); # UTF-32
# Explicit encoding
$octets = unicode::encode($string, 'UTF-16');
$string = unicode::decode($octets, 'UTF-16');
# Well-formed?
say "String is a well-formed Unicode string"
if unicode::valid($string);
say unicode::encode($string);
Returns an decoded representation of $octets
in $encoding
as a character string.
Returns an encoded representation of $string
in $encoding
as an octet string.
Returns the canonical encoding name.
Returns a normalized representation of $string
in Unicode normalization $form
as a character string.
Valid normalization forms are NFC
, NFD
, NFKD
and NFKC
.
Determine whether or not the supplied $string
is a well-formed Unicode string.
A well-formed Unicode string consist of the values U+0000..U+D7FF and U+E000..U+10FFFF excluding the noncharacter values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
- UTF-8
- UTF-16
- UTF-16LE
- UTF-16BE
- UTF-32
- UTF-32LE
- UTF-32BE
- UCS-2
- UCS-2BE
- UCS-2LE
- Can't decode %s of type %s
-
(W unicode)
- Can't decode a wide character string
-
(W unicode)
- Can't decode ill-formed %s octet sequence <%s>
-
(W unicode)
- Can't decode incomplete %s code unit <%s>
-
(W unicode)
- Can't encode %s of type %s
-
(W unicode)
- Can't interpret noncharacter code point U+%.4X as an abstract character
-
(W unicode)
- Can't map code point U+%.4X to %s encoding
-
(W unicode) Code point U+%.4X can't be represented in %s encoding codespace.
- Can't map surrogate code point U+%.4X to %s encoding
-
(W unicode) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
- Can't map noncharacter code point U+%.4X to %s encoding
-
(W unicode) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
- Can't map restricted code point U-%.8X to %s encoding
-
(W unicode) Code points in the range U-00110000 to U-7FFFFFFF.
JTC1/SC2/WG2 N 2175 Proposal to restrict the range of code positions to the values up to U-0010FFFF JTC1/SC2/WG2 N 2225 RESOLUTION M38.6 (Restriction of encoding space)
- Can't map extended code point %.8X to %s encoding
-
(W unicode) Code points in the range 2^31 to 2^64-1.
- Unknown encoding '%s'
-
(F)
- Unknown Unicode normalization form '%s'
-
(F)
- Usage: unicode::%s
-
(F) Subroutine %s was called with invalid number of arguments.
- Use of uninitialized value %s
-
(W uninitialized) Please see perldiag.