I'm going to ignore the j
, J
, f
, F
, d
, D
, p
, P
, u
, w
template characters.
I'm running this on a system with x86_64
architecture (little endian) with perl-5.38-2
, sizeof(int) == 4
.
pack TEMPLATE, LIST
takes LIST
and packs it into a string according to TEMPLATE
. E.g. pack('ac', 'a', 1)
returns "a\x01"
. There are two template charaters here: a
and c
. Each template character tells pack()
what to do with the next argument: a
to take one character from the first argument ('a'
), and put it into the resulting string, c
to take the next argument (the 8-bit signed integer 1
), and add chr(1)
to the resulting string.
Template characters differ in what values they take:
a
,A
,Z
,b
,B
,h
,H
(string template characters) take characters from a string (pack('a', 'a')
)c
,C
,W
,s
,S
,l
,L
,q
,Q
,i
,I
,n
,N
,v
,V
,U
(integer template characters) take integers (pack('c', 1)
)x
,X
,@
take nothing.
takes an integer (pack('.', 1)
), but it's rather similar to the previous group in what it does
Template characters can be followed by a repeat count, which tells pack()
how many values a template character takes:
- string template characters take values from one argument (
pack('a2', 'aa') == 'aa'
), but each next template character takes values from the next argument (pack('aa', 'a', 'a') == "aa"
) - for integer template characters each argument is a value (
pack('c2', 1, 1) == "\x01\x01"
)
Some even say, that repeat count is length (string length) in the case of string template characters.
Then:
- for
a
,A
,Z
a value is a character - for
b
,B
,h
,H
a value is a string representation of a digit a
,Z
pad values (if there are not enough characters) with\x00
(pack('a2', 'a') == "a\x00"
),A
pads with spaces (pack('A2', 'a') == 'a '
)Z
is likea
, but takes one character less and adds\x00
(pack('Z2', 'a') == "a\x00"
)- in the case of
b
,B
a digit is0..1
, forh
,H
it's0..f
b
,B
fill the resulting string with bits,h
,H
with nybblesb
,h
start with LSB/low nybble (pack('b', '1') == "\x01"
),B
,H
with MSB/high nybble (pack('B', '1') == "\x80"
)c
takes an integer in the range-128..127
,C
in the range0..255
,W
in the Unicode range (0..0x10ffff
), each argument produces one character in the resulting string (pack('c', 1) == "\x01"
)s
takes an integer in the range-0x8000..0x7fff
and produces 2 characters in the system's native byte order (pack('s', 1) == "\x01\x00"
in the case of a little endian system)S
takes an integer in the range0..0xffff
and produces 2 characters in the system's native byte orderl
takes an integer in the range-0x8000_0000..0x7fff_ffff
and produces 4 characters in the system's native byte order (pack('l', 1) == "\x01\x00\x00\x00"
in the case of a little endian system)L
takes an integer in the range0..0xffff_ffff
and produces 4 characters in the system's native byte orderq
takes an integer in the range-0x8000_0000_0000_0000..0x7fff_ffff_ffff_ffff
and produces 8 characters in the system's native byte order (pack('q', 1) == "\x01\x00\x00\x00\x00\x00\x00\x00"
in the case of a little endian system)Q
takes an integer in the range0..0xffff_ffff_ffff_ffff
and produces 8 characters in the system's native byte order- the range and the number of produced characters is system-dependent for
i
andI
, in my case they're-0x8000_0000..0x7fff_ffff
/4 (l
) and0..0xffff_ffff
/4 (L
) respectively n
takes an integer in the range0..0xffff
and produces 2 characters in the big-endian byte order (pack('n', 1) == "\x00\x01"
)N
takes an integer in the range0..0xffff_ffff
and produces 4 characters in the big-endian byte order (pack('N', 1) == "\x00\x00\x00\x01"
)v
takes an integer in the range0..0xffff
and produces 2 characters in the little-endian byte order (seeS
in the case of a little endian system)V
takes an integer in the range0..0xffff_ffff
and produces 4 characters in the little-endian byte order (seeL
in the case of a little endian system)
To put it briefly:
a
a character of a null-padded stringA
a character of a space-padded stringZ
a character of a null-terminated string (null-padded)b
a bit of a string (LSB first)B
a bit of a string (MSB first)h
a nybble of a string (low nybble first)H
a nybble of a string (high nybble first)c
signed char
(8-bit)C
unsigned char
(8-bit)W
a Unicode characters
signed short
(16-bit)S
unsigned short
(16-bit)l
signed long
(32-bit)L
unsigned long
(32-bit)q
signed quad
(64-bit)Q
unsigned quad
(64-bit)i
signed int
(native)I
unsigned int
(native)n
unsigned short
(16-bit, big endian)N
unsigned long
(32-bit, big endian)v
unsigned short
(16-bit, little endian)V
unsigned long
(32-bit, little endian)U
UTF-8 representation of a Unicode code point
By default pack()
operates in C0
(character) mode. In this mode values are added to the resulting string as characters:
pack('a', 'a') == 'a'
pack('A', 'a') == 'a'
pack('Z2', 'a') == "a\x00"
pack('b8', '11111111') == "\xff"
(a value is added as soon as there's a byte)pack('B8', '11111111') == "\xff"
pack('h2', '11') == "\x11"
pack('H2', '11') == "\x11"
pack('c', 1) == "\x01"
pack('C', 1) == "\x01"
pack('W', 1) == "\x01"
pack('s', 1) == "\x01\x00"
(each byte is added as a separate character)pack('S', 1) == "\x01\x00"
pack('l', 1) == "\x01\x00\x00\x00"
pack('L', 1) == "\x01\x00\x00\x00"
pack('q', 1) == "\x01\x00\x00\x00\x00\x00\x00\x00"
pack('Q', 1) == "\x01\x00\x00\x00\x00\x00\x00\x00"
pack('i', 1) == "\x01\x00\x00\x00"
(i == l
in my case)pack('I', 1) == "\x01\x00\x00\x00"
(I == L
in my case)pack('n', 1) == "\x00\x01"
pack('N', 1) == "\x00\x00\x00\x01"
pack('v', 1) == "\x01\x00"
(v == S
in the case of a little endian system)pack('V', 1) == "\x01\x00\x00\x00"
(V == L
in the case of a little endian system)
In U0
(UTF-8 byte) mode, which is turned on with, well, U0
(pack('U0...', ...)
), values are added to a sequence of bytes. Before returning from pack()
the sequence of bytes is typecasted or becomes the resulting string (or so it looks). As such the resulting sequence of bytes should be valid UTF-8. Also in this mode the ranges of a
, A
, Z
and W
reduced to 0..0xff
:
pack('U0a2', "\xdf\xbf") == "\x{7ff}"
("\xdf\xbf"
is the UTF-8 representation of"\x{7ff}"
)pack('U0A2', "\xdf\xbf") == "\x{7ff}"
pack('U0Z3', "\xdf\xbf") == "\x{7ff}\x00"
pack('U0b16', '1111' . '1011' . '1111' . '1101') == "\x{7ff}"
(whatb
takes in hex is'fbfd'
)pack('U0B16', '1101' . '1111' . '1011' . '1111') == "\x{7ff}"
(wahtB
takes in hex is'dfbf'
)pack('U0h4', 'fdfb') == "\x{7ff}"
pack('U0H4', 'dfbf') == "\x{7ff}"
pack('U0c2', 0xdf - 0x100, 0xbf - 0x100) == "\x{7ff}"
(thec
range is-0x80..0x7f
, so we need to adjust the values)pack('U0C2', 0xdf, 0xbf) == "\x{7ff}"
pack('U0W2', 0xdf, 0xbf) == "\x{7ff}"
pack('U0s', 0xbfdf - 0x10000) == "\x{7ff}"
pack('U0S', 0xbfdf) == "\x{7ff}"
pack('U0l', 0xbfdf) == "\x{7ff}\x00\x00"
pack('U0L', 0xbfdf) == "\x{7ff}\x00\x00"
pack('U0q', 0xbfdf) == "\x{7ff}\x00\x00\x00\x00\x00\x00"
pack('U0Q', 0xbfdf) == "\x{7ff}\x00\x00\x00\x00\x00\x00"
pack('U0i', 0xbfdf) == "\x{7ff}\x00\x00"
(i == l
in my case)pack('U0I', 0xbfdf) == "\x{7ff}\x00\x00"
(I == L
in my case)pack('U0n', 0xdfbf) == "\x{7ff}"
pack('U0N', 0xdfbf) == "\x00\x00\x{7ff}"
pack('U0v', 0xbfdf) == "\x{7ff}"
(v == S
in the case of a little endian system)pack('U0V', 0xbfdf) == "\x{7ff}\x00\x00"
(V == L
in the case of a little endian system)pack('U0U', 0x7ff) == "\x{7ff}"
(U
adds to the sequence of bytes the UTF-8 representation of its argument)
Do note that the sequence of bytes doesn't have to be valid UTF-8 at any intermediate step (pack('U0aXac', "\x80", "\xdf", 0xbf - 0x100) == "\x{7ff}"
, X
erases the last byte).
In addition to turning on the U0
mode explicitly, it's turned on implicitly when TEMPLATE
starts with U
(pack('Ua2', 0x7ff, "\xdf\xbf") == "\x{7ff}\x{7ff}"
). In C0
mode U
produces UTF-8 representation of its argument (pack('C0U', 0x7ff) == "\xdf\xbf"
, pack('aU', "\x80", 0x7ff) == "\x80\xdf\xbf"
). You can always switch the mode midway explicitly (pack('...U0...C0...', ...)
).
Or in other words, generally W
and U
take a code point and produce a character (pack('W', 1) == "\x01"
, pack('U', 1) == "\x01"
). But in U0
mode W
takes UTF-8 representation and produces a character (pack('U0W2', 0xdf, 0xbf) == "\x{7ff}"
), and in C0
mode U
takes a code point and produces UTF-8 representation (pack('C0U', 0x7ff) == "\xdf\xbf"
).
x
produces a null (pack('x') == "\x00"
).
X
takes a step back, removing the characters in the process (pack('aX', 'a') == ''
).
@
moves the current position in the resulting string, truncating or null-filling it in the process (pack('a@0', 'a') == ''
, pack('@1') == "\x00"
). The repeat count is an absolute position counted from the beginning of the resulting string.
.
is like @
that takes its argument not from a repeat count (pack('a.', 'a', 0) == ''
, pack('.', 1) == "\x00"
).
Template characters might be grouped with parenthesis. In this case the @
/.
's arguments are counted from the start of the innermost group (pack('a(a@0)', 'a', 'a') == 'a'
).
To add to the brief list:
$ docker run --rm -itv "$PWD":/host alpine:3.20
/ # apk add perl perl-test2-suite perl-utils
/ # for f in host/.*.pl host/*.pl; prove "$f" || break; done
perlpacktut
pack
unpack
Pack/Unpack Tutorial
What follows is my experiments: