When parsing things like binary file format headers, you generally need to know three things for each data field:
- how many, and what kind of, bytes to gobble
(e.g. input stream -->Buf:0x<0b 00>
) - how to unpack those bytes into a number, or other low-level type
(e.g.Buf:0x<0b 00>
-->11
) - how to transform that number into meaningful data
(e.g.11
-->"v1.1"
)
Of course, those three things are not necessarily orthogonal - for example the unpack rule often already implies the number of bytes to gobble, and the boundary between the 'unpack' and 'transform' steps is fuzzy. Also, constraints become applicable during the 'gobble' step, but are often more conveniently expressed in terms of the post-unpack value.
The traditional Perl way to do such parsing, is using imperative
while
/if
/else
logic that gobbles one or more fields with read()
, unpacks
them with unpack()
, and transforms them with custom Perl code. This approach
is flexible, but often tedious.
Enter grammars. Perl 6 grammars support all the parsing logic you could hope for, but unfortunately they operate only on opaque Unicode string characters - not on bytes. The design documents speculate that operating on the byte level will at some point be supported, but jnthn++ clarified that this feature won't make it into the first stable release of Perl 6.
Nonetheless, I think it is a good time to start thinking about this now, in particular with these goals:
-
Determine what the "optionally operate on the byte level" feature would actually entail, and if the current Perl 6 regex & grammar stuff is flexible enough to allow adding this feature in a future Perl 6.x release without major upheaval.
-
Determine what kind of additional features and syntactic sugar would be needed beyond that bare-bones support, to minimize redundancy between the gobble + unpack + transform steps and to make grammar-based binary parsing pleasant enough for actual use.
-
Determine how much of the above really needs Perl 6 core additions, and how much of it could be provided by modules. (Or at least, whether modules could provide alternative interim solutions.)
This post is meant to help start that discovery process.
- Table of Contents
An ICO (Windows icon) file can contain multiple bitmap images which portray the same logical icon at different sizes / bit depths. The ICO file header has an entry for each contained image (specifying the meta-data needed to select and decode it); the actual bitmap data of the images starts after the file header.
Suppose that we want to read the header of such a file, and represent each of its image entries as an object of the following class:
class ICO::ImageEntry {
has $.width; # between 1 and 256
has $.heigth; # between 1 and 256
has $.palette-size; # between 1 and 255, or undefined
has $.color-planes; # between 1 and 4
has $.bit-depth; # 1, 4, 8, 16, 24, 32, or undefined
has $.data-size;
has $.data-offset;
}
We shall do this by parsing the binary file using a grammar:
my Buf $raw = slurp "someicon.ico", :bin;
my ICO::ImageEntry @entries = ICO::Header::Grammar.subparse($raw).ast;
But what would that ICO::Header::Grammar
grammar look like?
If the speculated :bytes
modifier
were implemented (details below), one could probably use that together with the
built-in Buf.unpack
method to write the grammar like this:
grammar ICO::Header::Grammar {
token TOP {
:bytes
\x00 \x00 \x01 \x00 # signature
<image-count>
<image-entry> ** { $<image_count>.made }
{ make $<image-entry>».made; }s
}
token image-count { :bytes . . { make $/.Buf.unpack('S<') } }
token image-entry {
:bytes
<image-width> <image-height>
<palette-size> [ \x00 | \xff ] <color-planes> <bit-depth>
<data-size> <data-offset>
{ make ICO::ImageEntry.new: |$/.hash.deepmap(*.made); }
}
token image-width {
:bytes .
{ make $/.Buf.unpack('C') || 256 }
}
token image-height {
:bytes .
{ make $/.Buf.unpack('C') || 256 }
}
token palette-size {
:bytes .
{ make $/.Buf.unpack('C') || Any }
}
token color-planes {
:bytes <[\x00..\x04]> \x00
{ make $/.Buf.unpack('S<') || 1 }
}
token bit-depth {
:bytes [\x01 | \x04 | \x08 | \x10 | \x18 | \x20] \x00
{ make $/.Buf.unpack('S<') || Any }
}
token data-size {
:bytes <!before \x00 \x00 \x00 \x00> . . . .
{ make $/.Buf.unpack('L<') }
}
token data-offset {
:bytes . . . .
{ make $/.Buf.unpack('L<') }
}
}
Note: I made it a strictly-validating grammar, to demonstrate how various value constraints can be expressed.
Imagine a Grammar::Binary role (possibly provided by a CPAN module) which you can mix into your grammar, to provide a little more convenience. I'll expand on what exactly it would do in a dedicated section below; for now, a demonstration using a rewritten version of the grammar from the previous section:
grammar ICO::Header::Grammar does Grammar::Binary {
token TOP {
\x00 \x00 \x01 \x00 # signature
<image-count=.uint16l(1..*)>
<image-entry> ** { $<image_count>.made }
{ make $<image-entry>».made; }
}
token image-entry {
$<width> = <.uint8(1..255, 0=>256)>
$<height> = <.uint8(1..255, 0=>256)>
$<palette-size> = <.uint8(1..255, 0=>Nil)>
[ \x00 | \xff ]
$<color-planes> = <.uint16l(0=>1, 1..4)>
$<bit-depth> = <.uint16l(0=>Nil, 1, 4, 8, 16, 24, 32)>
$<data-size> = <.uint32l(1..*)>
$<data-offset> = <.uint32l>
{ make ICO::ImageEntry.new: |$/.hash.deepmap(*.made); }
}
}
Definitely shorter, and I think also easier to read and maintain.
ZIP archives are structured quite differently from ICO icons. The central index of contained files is near the end of the stream, not at the beginning, and within the stream each contained file is directly prefixed by a "local file header" which could be represented by an object of the following class:
class ZIP::LocalHeader {
has $.format-version; # e.g. "1.0"
has $.file-attr-type; # e.g. ZIP::AttrType::MS-DOS
has $.flags; # e.g. [?«<0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0>]
has $.compression; # e.g. ZIP::Compr::Deflate
has $.modification-date; # e.g. DateTime.new(...)
has $.crc-32; # e.g. Buf.new(...)
has $.compressed-size; # e.g. 283618
has $.uncompressed-size; # e.g. 357323
has $.filename; # e.g. "report.pdf"
has $.extra-header; # e.g. Buf.new(...)
}
Unlike in ICO::ImageEntry
, the fields aren't all numbers: They're a mix of
numbers, strings, buffers, a date, an array, and enum types.
For the sake of simplicity, let's say we only want to parse the first embedded file's local header, which is found right at the start of the zip archive:
my Buf $raw = slurp "archive.zip", :bin;
my ZIP::LocalHeader $firstheader = ZIP::LocalHeader::Grammar.subparse($raw).ast;
Assuming the same Grammar::Binary role as before, the grammar could look like this:
enum ZIP::AttrType <
MS-DOS Amiga OpenVMS UNIX VM_CMS
Atari_ST OS_2 Macintosh Z-System CP_M
Windows_NTFS MVS VSE Acorn Risc VFAT
alt_MVS BeOS Tandem OS_400 OS_X
>;
enum ZIP::Compr (
Uncompressed => 0, Shrink => 1, Reduce1 => 2, Reduce2 => 3,
Reduce3 => 4, Reduce4 => 5, Implode => 6, Tokenize => 7,
Deflate => 8, Deflate64 => 9, ImplodeX => 10, Bzip2 => 12,
Lzma => 14, Terse => 18, Lz77 => 19, Wavpack => 97,
Ppmd => 98
);
grammar ZIP::LocalHeader::Grammar does Grammar::Binary {
token TOP {
PK \x03 \x04 # signature
<format-version=.uint8(Any() => { "{Int $_/10}.{$_ % 10}" })>
<file-attr-type=.uint8(Any() => ZIP::AttrType>
<flags=.bitfield16l)>
<compression=.uint16l(Any() => ZIP::Compr>
<modification-date=.uint32l(Any() =>
{ DateTime.new(year => vec($_, 25, 7),
month => vec($_, 21, 4),
day => vec($_, 16, 5),
hour => vec($_, 11, 5),
minute => vec($_, 5, 6),
second => vec($_, 0, 5)) }
)>
<crc-32=.uint32l>
<compressed-size=.uint32l>
<uncompressed-size=.uint32l>
<_filename-size=.uint16l>
<_extra-header-size=.uint16l>
<filename=.str(:bytes($<_filename-size>.made),
:enc($<flags>.made.[11] ?? 'utf8' !! 'cp437'))>
<extra-header=.buf(:bytes($<_extra-header-size>.made))>
{ make ZIP::LocalHeader.new:
|$/.hash.map({ next if .key ~~ /^_/; .key => .value.made }) }
}
}
#| From the binary representation of a given number, return a "substring"
#| of bits as a new number, e.g. vec(150, 1, 4) returns 11:
#| 150 = 0b10010110
#| ^^^^ = 0b1011 = 11
sub vec ($number, $rindex, $bits) {
$number +> $rindex +& (2 ** $bits - 1)
}
Unlike the ICO grammar, this one is mostly about unpacking, and does not do any validating except for matching the signature at the start. Both kinds of use-cases are common, so it's good to have an example of each.
Instead of make
'ing all result values directly in the grammar, sometimes
it's better to factor out any post-processing beyond the immediate unpacking
step into a separate action class. (In fact some people prefer to always do
that).
The question is, how to access the unpacked values in that case? One could do:
grammar ...::Grammar ... {
...
token foo { <uint16(...)> }
}
class ...::Actions {
...
method foo ($/) { make some-transformation( $<uint16>.made ) }
}
But that's cumbersome, requires the action method to know which unpack type the grammar method used, and is inefficient due to the extra named capture that needs to be created and passed around.
It could be written more elegantly if there were a way for the foo
token to
specify that (as far as $/ is concerned) it wants to be <uint16(...)>
, and
not just contain that as a subrule. Off the top of my head, here are some
possibilities for new syntax which could be allowed for this:
$/ = <.uint16(...)>
-- like$<foo> = <.uint16(...)>
in the parent token<goto uint16(...)>
-- akin togoto &sub
in Perl 5<=uint16(...)>
<as uint16(...)>
I think I like the fourth syntax best, so assuming that existed, the grammar and
action class could be written like this (using the same enums and sub vec
as
above):
grammar ZIP::LocalHeader::Grammar does Grammar::Binary {
token TOP {
PK \x03 \x04 # signature
<format-version>
<file-attr-type>
<flags>
<compression>
<modification-date>
<crc-32>
<compressed-size>
<uncompressed-size>
<_filename-size>
<_extra-header-size>
<filename( $<_filename-size>.made, $<flags>.made.[11] )>
<extra-header( $<_extra-header-size>.made )>
}
token format-version { <as uint8> }
token file-attr-type { <as uint8> }
token flags { <as bitfield16l> }
token compression { <as uint16l> }
token modification-date { <as uint32l> }
token crc-32 { <as uint32l> }
token compressed-size { <as uint32l> }
token uncompressed-size { <as uint32l> }
token _filename-size { <as uint16l> }
token _extra-header-size { <as uint16l> }
token filename ($bytes, $flag11) {
<as str(:$bytes, :enc($flag11 ?? 'utf8' !! 'cp437'))>
}
token extra-header ($bytes) {
<as buf(:$bytes)>
}
}
class ZIP::LocalHeader::Actions {
method TOP ($/) {
make ZIP::LocalHeader.new:
|$/.hash.map({ next if .key ~~ /^_/; .key => .value.made })
}
method format-version ($_) {
.make: "{ Int .made / 10 }.{ .made % 10 }"
}
method file-attr-type ($_) {
.make: ZIP::AttrType[.made]
}
method compression ($_) {
.make: ZIP::Compr[.made]
}
method modification-date ($/) {
.make: DateTime.new(year => vec(.made, 25, 7),
month => vec(.made, 21, 4),
day => vec(.made, 16, 5),
hour => vec(.made, 11, 5),
minute => vec(.made, 5, 6),
second => vec(.made, 0, 5))
}
}
Alternatively, maybe there needs to be an way to propose subrule aliases to
proper tokens, so that e.g. the format-version
subrule can be specified via a
simple alias of the form <format-version=.uint8>
inside token TOP
, but still
cause ZIP::LocalHeader::Actions.format-version
to fire.
S05 doesn't go into too much detail on the :bytes modifier that would allow regexes to operate on bytes instead of characters.
I assume that when it is activated for a regex or grammar, it will have the following consequences:
Grammar.parse
andRegex.ACCEPTS
need to be passed aBuf
/Blob
instead of aStr
.
matches a single byte- character literals operate on bytes:
- hex character literals match the bytes with the corresponding unsigned int
values, i.e.
\x00
matches the null byte - ASCII character literals match the bytes with the corresponding ASCII
representation, i.e.
z
matches the same byte as\x7A
- non-ASCII Unicode character literals throw an error when encountered (compile-time, if possible)
- hex character literals match the bytes with the corresponding unsigned int
values, i.e.
- character classes operate on bytes too, e.g. <-[\x00]> matches any byte but the null byte
- The resulting Match objects provide access to matched fragments via .Buf instead of via .Str
Letting a grammar compose this role would:
- Enable the
:bytes
modifier for all the grammar's tokens (as a convenience). - Provide several grammar methods for parsing and unpacking (in one go) various common binary data types.
Which methods to include should be subject to proper research and deliberation; for now, here's a preliminary list of a bunch of methods that seem sensible:
# Cross-platform fixed size integers:
token int8 (**@rule) #= 1 byte, as a signed integer
token int16l (**@rule) #= 2 bytes, as a signed integer (little-endian)
token int16b (**@rule) #= 2 bytes, as a signed integer (big-endian)
token int32l (**@rule) #= 4 bytes, as a signed integer (little-endian)
token int32b (**@rule) #= 4 bytes, as a signed integer (big-endian)
token uint8 (**@rule) #= 1 byte, as an unsigned integer
token uint16l (**@rule) #= 2 bytes, as an unsigned integer (little-endian)
token uint16b (**@rule) #= 2 bytes, as an unsigned integer (big-endian)
token uint32l (**@rule) #= 4 bytes, as an unsigned integer (little-endian)
token uint32b (**@rule) #= 4 bytes, as an unsigned integer (big-endian)
# Stringy values:
token cstr (**@rule, :enc) #= null-terminated string
token str (**@rule, :bytes!, :enc, :trim) #= fixed-sized string
token buf (**@rule, :bytes!) #= fixed-sized buffer
The **@rule
parameter can be used to specify the allowed values (for the
purposes of matching) via one or more smart-matchable templates, e.g.
<uint16l(10..50, 127, * %% 2)>
would only match byte pairs whose
little-endian unsigned integer interpretation is either between 10 and 50,
or exactly 127, or an even number.
As a special case, when Pair
values are passed to **@rule
, only the key
is used as matcher: the value is instead used to transform unpacked values which
matched that particular matcher:
example | behavior |
---|---|
<.uint8(1..255, 0=>256)> |
any byte is matched; it is returned as its unpacked value, except if it's a null byte in which case it is returned as number 256. |
<.uint8(0=>256, Any)> |
same as previous |
<.uint8(0..19 => MyEnumType)> |
only a byte in the range <[\x00..\x13]> is matched; its unpacked value is treated as an index to look up an enum member which to return. |
The following transformation types would be allowed:
Numeric
: use as a direct replacement valueCallable
: pass unpacked value as parameter, and use return valueAssociative
: look up a member using the unpacked value as keyPositional
: look up a member using the unpacked value as index- an enum type: look up a member using the unpacked value as index
Performance may be a concern here; instead of actually invoking smart-match on the unpacked values at runtime, maybe those methods actually want to be macros instead, so that they can transform simple cases (like Range matchers) into grammar code at compile time:
user writes | macro generates |
---|---|
<foo=.uint16l(1..*)> |
|
<foo=.uint16l(0..300)> |
$<foo> = ( [ . \x00 | <[\x00..\x2c]> \x01 ] { make $/.Buf.unpack("S<") } ) |
Of course, all of this is just idle speculation. Maybe an altogether different approach to providing convenience functionality for binary grammars, ends up being preferable.
For maximum convenience in simple unpacking use-cases that don't need a whole
grammar, a quote-like operator could be provided (Maybe called
u//
?), which would behave like m//
except that it:
- Automatically enables
:bytes
. - Automatically makes the
Grammar::Binary
methods available. - Collects any unpacked values into a positional list, and returns that instead of a Match object.
So for example this unpacking operation from the Perl 5 docs:
my @registers = unpack 'v2 (vXXCC)5 v5', $stackframe;
...could be written more readably (albeit also more verbosely) like this:
my @registers = $stackframe ~~ u/ <uint16l> ** 2
[ <uint16l> && <uint8> <uint8> ] ** 5
<uint16l> ** 5 /;
@skids: That does sound like an interesting approach.
Having a declarative way to describe binary structures that could be used for both reading and writing, would definitely be cool.
However, I doubt it could replace the generic grammar-based binary parsing support completely, in particular with respect to the following:
How would the "transform" step be handled? I.e. what would turn the uint16 value
11
into the version stringv1.1
, or the uint32 value904635223
into the datetime2007-02-24, 14:33:06
? I think it's nice to keep the information how to interpret a value, together with the information how to pack/unpack it.Do you think a data type for TLV fields could handle slightly more unusual cases such as the ZIP header shown above, where the a variable-length field is not immediately preceded by its length:
A Grammar can easily handle that, but a rigid data-structure based solution would have trouble including all possible cases of "non-standard" layouts and complexities.
What if, instead of merely unpacking a structure whose precise offset and layout in the input buffer is known, you actually need to search for it in the input buffer, or parse an incompletely known structure, or parse one of multiple different possible structures? In that case, you'd still want a regex/grammar right?
Also, the needs of NativeCall are somewhat different from the needs of parsing binary formats, e.g. data structures tend to be subject to memory-aligned in the former case but not the latter. Your proposed "compact structure" classes would have to be able to handle both cases I guess?
PS: One more thing: Do you envision those "compact structure" classes to be a copy of the data from the input stream, or actually references into it? Think of a case like wanting to modify some header fields of a binary file - would that involve parsing the header into a "struct" class, modifying it, and then manually splicing the struct back into the larger data buffer - or would it update automatically?