This is designed to be a working proposal. Comments/corrections/suggestions are welcome, as the first draft was written fairly hastily. I'm working on doing a rough implementation to play around with, beginning with the Binex proposal, which can be found here. It is not currently a full implementation of this proposal, but progressing rapidly.
Grammars in Raku are awesome, and allow for some truly amazing text parsing.
Unfortunately, they are less than ideal for binary files,
and there is no way to have them support matching objects, both of which would be very useful (image being able to pattern match on an AST!)
This requires writing complex and/or error prone workaround code.
Create an easy-to-use, Raku-ish way to handle grammars that are binary or objecty.
Practically speaking, these are two separate proposals, and will likely involve different optimizations,
but are treated together so that their end-user solutions are as similar as posisble, e.g., saying Grammar is binary
or Grammar is objecty
and then modifying the interpretation of the tokens to a regex-like slang.
A basic proposal binary grammar would look something like this:
grammar UTF-8 is binary[8] {
token TOP { <byte-order-mark>? <utf-8-glyph>* }
token byte-order-mark { xEF xBB xFF }
proto token utf-8-glyph { * }
token utf-8-glyph:single { b0....... }
token utf-8-glyph:double { b110..... b10...... }
token utf-8-glyph:triple { b1110.... b10...... b10...... }
token utf-8-glyph:quadruple { b11110... b10...... b10...... b10......}
proto token utf-8-stream { <byte-order-mark> <utf-8-glyph> * }
}
Where x00
represents a byte in hexademical, o0000
in octal, b00000000
etc. For simplicity's sake, each byte should be written out in full.
Because some grammar definitions may benefit from it, while the default unit would be a byte, it might be useful to base the grammar not on a byte by byte sequence, but rather words of 16, 32, or 64 bits, enabled via parameterization (is binary[16]).
In such cases, an underscore may delineate groups but is otherwise ignored, e.g. 0xffff_ffff
for a 32 bit hex value, although 0xf_f_f_f_f_f_f_f
would be theoretically valid too).
In a binary grammar, strings are considered invalid, either bare or quoted, although they could included via a method that returns a Blob
(similar to a method that returns a string).
Alternatives can be given like in regular Regex, using |
(LTM) or ||
(short circuit).
For character classes, I see two useful ideas:
<[ x00 .. x1f ]>
would match values from 0 to 31.<[ b.......1 ]>
would match odd numbers.<[ b.......1 b00000000]>
would match odd numbers or 0.
The middle one, of course, would seem to be pointless given a bare b.......1
would be valid, but when used as a negator, it could be a fair bit more powerful, where < +[x80 .. xff] -[b.......1]>
would represent all odd upper ASCII values. I think it would be optimal and not particularly complex to allow a construction like o00.0 .. o04.8
and treating it similar to the string range, e.g., 00.0, 00.1 … 00.8, 01.0, 01.2
with the dot preserved as a wildcard in all. An optimization stage can try to determine if there's a compact representation < +[x80 .. xff] -[b.......1]>
becomes b1..._...1
, and if not, fall back to a sequential test.
For use in inline situations, all of the //
syntax would be available but adding on as an option :bin
:
- match:
m:bin:options/find/
- substition:
s:bin:options/find/replace/
- substition (nondestructive):
S:bin:options/find/replace/
- transliteration:
tr:bin:options/swap/swap/options
- transliteration (nondestructive):
TR:bin:options/swap/swap/
One issue that seems odd, but with real world use, would be to allow captures/tokens betwixt bytes/words. In the aforementioned Zelda 3 article, the format would effectively be for us:
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk { <header> <data: +$<header><length> > }
token header { b..._..... }
token data($count) { x.. ** {$count} }
}
The catch, however, is how to handle the splitting up of header into the command (first three bits) and the length (latter five bits). I'm not sure what the best syntax to use here would be. No doubt there are other formats where a sub byte item might even be repeated. In this particular case, a work around could be to say
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk {
my $*cmd;
my $*length;
<header>
<data: $*cmd, $*length>
}
token header {
b..._..... {
$*cmd = +$¢ +> 5;
$*length = +$¢ +& 31 + 1; # length 0 is 1
}
}
enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
multi token data( Copy , $count) { x.. ** {$count} }
multi token data(ByteRept | ByteIncl, $count) { x.. }
multi token data(WordRept | CopyExst, $count) { x.. ** 2 }
}
While that would work, it seems inelegant (and making it impossible to handle a token that ends in the middle of a byte/word). Instead, we'll provide an additional option of X
and Z
, where X
means “bit I don't care about, shove it out of the way” and Z
means “bit I don't care about, but want it zeroed out”.
The &
/&&
conjunctions are not commonly used in string-based grammars, but this could be a great place to use them with regularity. Because you could do (at least if needing to split a single byte):
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk {
[<cmd> && <length>]
<data: $<cmd>.head, $<length>.head>
}
token cmd { b..._XXXXX }
token length { bZZZ_..... }
enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
multi token data( Copy , $count) { x.. ** {$count} }
multi token data(ByteRept | ByteIncl, $count) { x.. }
multi token data(WordRept | CopyExst, $count) { x.. ** 2 }
}
The open question with this approach is what the match value of should be. To reduce the problem:
token a { <x> & <y> }
token b { <x> <x> }
token c { b....XXXX <x> }
token x { b....XXXX }
token y { bZZZZ.... }
When matching b11110001
on token a
, we'd want x
to blobify to b00001111
, and y
to b00000001
. But what would we want to a
to blobify to? We have three options: the original match (b11110001
), and either of the two captures (b00001111
or b00000001
). The answer might seem obvious to just use the original match, but someone might want to do something like in token b, and when matching b11110001 b10100011
expect b
to blobify to b00001111 b00001010
.
Some off-the-top-of-my-head potential solutions, without regard for complexity of implementation and no particular order:
- Only modify literals within a given token
Tokenc
above would blobify r-shifting the first byte, but leaving the second in place, but blobifying$<c><x>
would reveal the modification specified inx
- Create two different methods of blobifying.
One would return a match directly, and the other would return the modified value (probably the direct match as default). The problem of tokena
would remain, though, as there would now be two modified values, and if there were a second junction, at least four, etc., with no clear way to distinguish them. - Scrap the idea entirely
I don't really like this one, but I s'pose it's one solution. - Only certain tokens allow
Z
orX
This could be done via a traitis scoured
or with a different declarator. Those tokens would gain the ability to useZ
andX
in their definitions, but lose the ability to use&
,&&
(|
and||
would not be affected, since they match one item). Those special tokens that include other special tokens will use the modified values in place, since the lack of&
operands means we can guarantee no overlapped values. To use the match values, use a regular token, or as a one-off option, perhaps the syntax<,foo>
which is currently otherwise invalid.
My test implementation doesn't yet handle the operators, so I've not had to deal with the question too much yet, but it's looming.
Perhaps in a later version of the standard (because of the complexity of the code to support it and no doubt speed implications), an optional trait "is maligned" (because O(fun) names) could be added later to allow for non-full-byte/word tokens, without compromising previous code.
The idea for the object grammar came to me when I was processing some part-of-speech tagged text. Each word was an object whose class looks something like this (simplified for this document).
class Word {
has $.word;
has $.lexeme;
has $.part-of-speech;
has $.number;
has $.gender;
has $.tense;
has $.person;
}
For matching with objects, I think usurping the character class syntax, and hacking it a bit would provide a nice, generally clear syntax to allow for matching on types or attributes/values.
grammar ObjexMatcherSyntax {
rule TOP { '<' ~ '>' <match-container>+ }
rule match-container { '[' ~ ']' <match> }
rule match { <type>* ':' <arguments> }
rule type { <sign> <typename> }
token sign { '-' || '+'? }
}
Arguments would follow standard Raku syntax, with the following interpretations:
- Positional arguments are smartmatched against the object (e.g.
<[Int: 1]>
would match an Int value of 1, and<[Int: * > 5, * %% 2]>
would match all even Int values over 5 (6,8,10, etc, but not 1 or 7). - Attribute arguments are similar smartmatched against the object's attribute. So
<[Rat: :denominator(1)]>
would match only whole number Rats, and<[Rat: :denominator(1,2,4 ... *)]>
would match any power-of-two denominator because smart matching a list checks to see if it contains the element.
It may be that adding the +/-
syntax for the type is overkill, and it would be better to keep with only additives, using the pipe |
that's used elsewhere in Raku (after all, if someone really wanted, they could define a subset that explicitly handled more complex types). That would greatly simplify the syntax. Thoughts?
Maybe it's just for my initial use case, but I feel like the typical use case for an Objex would want quicker access to the values/attributes of matched objects. Maybe that's just me though. But it definitely presents a different usecase over strings. Rarely, if ever, do we care about the distinction between a character (single element) and a string (sequence) because Raku doesn't distinguish them. But when dealing with objects, such a distinction IS suddenly important as character : object :: string : list. For this reason, I think it might be a good idea to add an additional declarator to an Objex, which would be simply object
(surprisingly and luckily, this is not used at all in the Raku spec!). The contents of the object
would be identically to the selector described above (just without the arrow brackets, and only require brackets if more than one selector). Thus the custom declarators of an grammar is objecty
would be:
objex
: backtracking, Match contains a List/Seqrule
/token
: synonymous in our case, Match contains a List/Seqobject
: Match contains an object directly.
I suppose it's possible to avoid adding new declarators and just say rule = sequence, token = one off object, but a new concept deserves a new declarator to avoid confusion. Using this idea, assuming I wanted to identify a sequence as a valid noun + adjective sequence, I might do the following:
grammar ModifiedNoun is objecty {
token TOP {
<noun> # the base noun
<adj-list: # followed by adjectives that
$<noun>.gender, match the noun's gender and
$<noun>.number> match the noun's number
}
token adj-list($g = *, $n = *) {
[
<adj: $g, $n>+ # any number of adjectives that agree
<list-coordinator> # if there's a list, need an and/or at the end.
]
<adj: $g, $n> # an agreeing adjective
}
object noun { Word: :part-of-speech<noun> }
object coordinator {
Word:
:part-of-speech<coordinator>
:lexeme('y'|'o') # only want and/or
}
object adj($g = *, $n = *) {
Word:
:part-of-speech<adjective>
:gender($g) # default of Whatever matches all
:number($n)
}
}
Without an object
option, the TOP and noun tokens would be a bit messier:
grammar ModifiedNoun {
token TOP {
<noun>
<adj-list:
$<noun>[0].gender,
$<noun>[0].number
>
}
token noun {
<[Word: :part-of-speech<noun>]>
}
which works, I guess, but just isn't as clean.
For use in inline situations, all of the //
syntax would be available but adding on as an option :obj
:
- match:
m:obj:options/find/
- substition:
s:obj:options/find/replace/
- substition (nondestructive):
S:obj:options/find/replace/
- transliteration:
tr:obj:options/swap/swap/
- transliteration (nondestructive):
TR:obj:options/swap/swap/
- Updated April 17th to discuss the class of the
&
operator with theX
andZ
values, and fixed a few other typos (/foo/bar/options
isn't Raku, duh) - Updated April 9th to integrate bgills' excellent suggestions on
X
andZ
and inline naming, and fixed typos
I poke around undocumented binaries as a hobby, mostly in the retrocomputing space. So, I think I have some useful thoughts and feedback. I have looked high and low for something that does what you're trying to do without much real luck, so this proposal is very much of interest to me. All that said, I'm very new to Raku and that's just going to show in some of my comments. I'm still wrapping my head around grammars, so please forgive my obvious ignorance on that front.
I was on the fence about giving feedback because "hobby level" understanding is not the same thing as professional understanding, but here goes...
@alabamenhu for the Inline Binex, you probably want to support bitwise AND, OR, XOR, and NOT along with arithmetic shifts. All of these could potentially be chained. I realize this can be done later once the struct is extracted and placed into a Raku-native data structure. But, I expect it to be especially convenient to do right in the binex in many cases. I think spotted a post where someone may have mentioned this already.
An example use-case might be extracting data with this grammar, placing it into a data structure. Then, reusing the bytes that were just extracted and performing a rolling xor to compute a simple checksum to verify the integrity of the data against some other struct in the file containing the expected value.
As for the split bytes situation, all I can say is there are a lot of sub-byte data structures out there. So, yes, it's a good idea to express bytes as b00000000 instead of not representing them to the programmer and making him fish for them. Two specific sub-byte data structure examples that come to mind as reasonably common are:
I really look forward to seeing your solution and hope the end result is elegant. Would the comment about the
m/../
selector work for these cases?Questions:
A modern example of where a non-standard character set might show up is desktop publishing files. Sometimes the Unicode Private Use Area is used to represent alternate glyphs of a specific character. An older example would be game cartridges that have completely bonkers character sets that don't conform to any convention.
Some other thoughts:
I realize you probably can't make this grammar bidirectional from the same definition. (I think you'd run into the Halting problem trying to do that — but don't quote me.) That said, some facility to go from Raku native structures directly to a binary struct via a grammar would be amazing. Even the well-known Kaitai Struct only extracts, it can't be used to write back to a binary blob. Is the trip from Raku native struct back to binary via grammar on your proverbial radar?
This is going to seem farther afield, but hear me out.... Is there an analogous grammar (in use or proposed) for handling digital signals? The reason I ask is lots of binary data isn't just baked into an EEPROM lots of binary data is flying over the air in some sort of digital encoding scheme. I think being extract meaningful data up the stack from encoded waveform to binary frame/packet/blob and then extract useful data all by chaining grammars would be mind-bending and extremely useful in a lot of applications.
Very cool project. I hope at least some of my comments were useful.