This is designed to be a working proposal. Comments/corrections/suggestions are welcome, as the first draft was written fairly hastily. I'm working on doing a rough implementation to play around with, beginning with the Binex proposal, which can be found here. It is not currently a full implementation of this proposal, but progressing rapidly.
Grammars in Raku are awesome, and allow for some truly amazing text parsing.
Unfortunately, they are less than ideal for binary files,
and there is no way to have them support matching objects, both of which would be very useful (image being able to pattern match on an AST!)
This requires writing complex and/or error prone workaround code.
Create an easy-to-use, Raku-ish way to handle grammars that are binary or objecty.
Practically speaking, these are two separate proposals, and will likely involve different optimizations,
but are treated together so that their end-user solutions are as similar as posisble, e.g., saying Grammar is binary
or Grammar is objecty
and then modifying the interpretation of the tokens to a regex-like slang.
A basic proposal binary grammar would look something like this:
grammar UTF-8 is binary[8] {
token TOP { <byte-order-mark>? <utf-8-glyph>* }
token byte-order-mark { xEF xBB xFF }
proto token utf-8-glyph { * }
token utf-8-glyph:single { b0....... }
token utf-8-glyph:double { b110..... b10...... }
token utf-8-glyph:triple { b1110.... b10...... b10...... }
token utf-8-glyph:quadruple { b11110... b10...... b10...... b10......}
proto token utf-8-stream { <byte-order-mark> <utf-8-glyph> * }
}
Where x00
represents a byte in hexademical, o0000
in octal, b00000000
etc. For simplicity's sake, each byte should be written out in full.
Because some grammar definitions may benefit from it, while the default unit would be a byte, it might be useful to base the grammar not on a byte by byte sequence, but rather words of 16, 32, or 64 bits, enabled via parameterization (is binary[16]).
In such cases, an underscore may delineate groups but is otherwise ignored, e.g. 0xffff_ffff
for a 32 bit hex value, although 0xf_f_f_f_f_f_f_f
would be theoretically valid too).
In a binary grammar, strings are considered invalid, either bare or quoted, although they could included via a method that returns a Blob
(similar to a method that returns a string).
Alternatives can be given like in regular Regex, using |
(LTM) or ||
(short circuit).
For character classes, I see two useful ideas:
<[ x00 .. x1f ]>
would match values from 0 to 31.<[ b.......1 ]>
would match odd numbers.<[ b.......1 b00000000]>
would match odd numbers or 0.
The middle one, of course, would seem to be pointless given a bare b.......1
would be valid, but when used as a negator, it could be a fair bit more powerful, where < +[x80 .. xff] -[b.......1]>
would represent all odd upper ASCII values. I think it would be optimal and not particularly complex to allow a construction like o00.0 .. o04.8
and treating it similar to the string range, e.g., 00.0, 00.1 … 00.8, 01.0, 01.2
with the dot preserved as a wildcard in all. An optimization stage can try to determine if there's a compact representation < +[x80 .. xff] -[b.......1]>
becomes b1..._...1
, and if not, fall back to a sequential test.
For use in inline situations, all of the //
syntax would be available but adding on as an option :bin
:
- match:
m:bin:options/find/
- substition:
s:bin:options/find/replace/
- substition (nondestructive):
S:bin:options/find/replace/
- transliteration:
tr:bin:options/swap/swap/options
- transliteration (nondestructive):
TR:bin:options/swap/swap/
One issue that seems odd, but with real world use, would be to allow captures/tokens betwixt bytes/words. In the aforementioned Zelda 3 article, the format would effectively be for us:
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk { <header> <data: +$<header><length> > }
token header { b..._..... }
token data($count) { x.. ** {$count} }
}
The catch, however, is how to handle the splitting up of header into the command (first three bits) and the length (latter five bits). I'm not sure what the best syntax to use here would be. No doubt there are other formats where a sub byte item might even be repeated. In this particular case, a work around could be to say
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk {
my $*cmd;
my $*length;
<header>
<data: $*cmd, $*length>
}
token header {
b..._..... {
$*cmd = +$¢ +> 5;
$*length = +$¢ +& 31 + 1; # length 0 is 1
}
}
enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
multi token data( Copy , $count) { x.. ** {$count} }
multi token data(ByteRept | ByteIncl, $count) { x.. }
multi token data(WordRept | CopyExst, $count) { x.. ** 2 }
}
While that would work, it seems inelegant (and making it impossible to handle a token that ends in the middle of a byte/word). Instead, we'll provide an additional option of X
and Z
, where X
means “bit I don't care about, shove it out of the way” and Z
means “bit I don't care about, but want it zeroed out”.
The &
/&&
conjunctions are not commonly used in string-based grammars, but this could be a great place to use them with regularity. Because you could do (at least if needing to split a single byte):
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk {
[<cmd> && <length>]
<data: $<cmd>.head, $<length>.head>
}
token cmd { b..._XXXXX }
token length { bZZZ_..... }
enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
multi token data( Copy , $count) { x.. ** {$count} }
multi token data(ByteRept | ByteIncl, $count) { x.. }
multi token data(WordRept | CopyExst, $count) { x.. ** 2 }
}
The open question with this approach is what the match value of should be. To reduce the problem:
token a { <x> & <y> }
token b { <x> <x> }
token c { b....XXXX <x> }
token x { b....XXXX }
token y { bZZZZ.... }
When matching b11110001
on token a
, we'd want x
to blobify to b00001111
, and y
to b00000001
. But what would we want to a
to blobify to? We have three options: the original match (b11110001
), and either of the two captures (b00001111
or b00000001
). The answer might seem obvious to just use the original match, but someone might want to do something like in token b, and when matching b11110001 b10100011
expect b
to blobify to b00001111 b00001010
.
Some off-the-top-of-my-head potential solutions, without regard for complexity of implementation and no particular order:
- Only modify literals within a given token
Tokenc
above would blobify r-shifting the first byte, but leaving the second in place, but blobifying$<c><x>
would reveal the modification specified inx
- Create two different methods of blobifying.
One would return a match directly, and the other would return the modified value (probably the direct match as default). The problem of tokena
would remain, though, as there would now be two modified values, and if there were a second junction, at least four, etc., with no clear way to distinguish them. - Scrap the idea entirely
I don't really like this one, but I s'pose it's one solution. - Only certain tokens allow
Z
orX
This could be done via a traitis scoured
or with a different declarator. Those tokens would gain the ability to useZ
andX
in their definitions, but lose the ability to use&
,&&
(|
and||
would not be affected, since they match one item). Those special tokens that include other special tokens will use the modified values in place, since the lack of&
operands means we can guarantee no overlapped values. To use the match values, use a regular token, or as a one-off option, perhaps the syntax<,foo>
which is currently otherwise invalid.
My test implementation doesn't yet handle the operators, so I've not had to deal with the question too much yet, but it's looming.
Perhaps in a later version of the standard (because of the complexity of the code to support it and no doubt speed implications), an optional trait "is maligned" (because O(fun) names) could be added later to allow for non-full-byte/word tokens, without compromising previous code.
The idea for the object grammar came to me when I was processing some part-of-speech tagged text. Each word was an object whose class looks something like this (simplified for this document).
class Word {
has $.word;
has $.lexeme;
has $.part-of-speech;
has $.number;
has $.gender;
has $.tense;
has $.person;
}
For matching with objects, I think usurping the character class syntax, and hacking it a bit would provide a nice, generally clear syntax to allow for matching on types or attributes/values.
grammar ObjexMatcherSyntax {
rule TOP { '<' ~ '>' <match-container>+ }
rule match-container { '[' ~ ']' <match> }
rule match { <type>* ':' <arguments> }
rule type { <sign> <typename> }
token sign { '-' || '+'? }
}
Arguments would follow standard Raku syntax, with the following interpretations:
- Positional arguments are smartmatched against the object (e.g.
<[Int: 1]>
would match an Int value of 1, and<[Int: * > 5, * %% 2]>
would match all even Int values over 5 (6,8,10, etc, but not 1 or 7). - Attribute arguments are similar smartmatched against the object's attribute. So
<[Rat: :denominator(1)]>
would match only whole number Rats, and<[Rat: :denominator(1,2,4 ... *)]>
would match any power-of-two denominator because smart matching a list checks to see if it contains the element.
It may be that adding the +/-
syntax for the type is overkill, and it would be better to keep with only additives, using the pipe |
that's used elsewhere in Raku (after all, if someone really wanted, they could define a subset that explicitly handled more complex types). That would greatly simplify the syntax. Thoughts?
Maybe it's just for my initial use case, but I feel like the typical use case for an Objex would want quicker access to the values/attributes of matched objects. Maybe that's just me though. But it definitely presents a different usecase over strings. Rarely, if ever, do we care about the distinction between a character (single element) and a string (sequence) because Raku doesn't distinguish them. But when dealing with objects, such a distinction IS suddenly important as character : object :: string : list. For this reason, I think it might be a good idea to add an additional declarator to an Objex, which would be simply object
(surprisingly and luckily, this is not used at all in the Raku spec!). The contents of the object
would be identically to the selector described above (just without the arrow brackets, and only require brackets if more than one selector). Thus the custom declarators of an grammar is objecty
would be:
objex
: backtracking, Match contains a List/Seqrule
/token
: synonymous in our case, Match contains a List/Seqobject
: Match contains an object directly.
I suppose it's possible to avoid adding new declarators and just say rule = sequence, token = one off object, but a new concept deserves a new declarator to avoid confusion. Using this idea, assuming I wanted to identify a sequence as a valid noun + adjective sequence, I might do the following:
grammar ModifiedNoun is objecty {
token TOP {
<noun> # the base noun
<adj-list: # followed by adjectives that
$<noun>.gender, match the noun's gender and
$<noun>.number> match the noun's number
}
token adj-list($g = *, $n = *) {
[
<adj: $g, $n>+ # any number of adjectives that agree
<list-coordinator> # if there's a list, need an and/or at the end.
]
<adj: $g, $n> # an agreeing adjective
}
object noun { Word: :part-of-speech<noun> }
object coordinator {
Word:
:part-of-speech<coordinator>
:lexeme('y'|'o') # only want and/or
}
object adj($g = *, $n = *) {
Word:
:part-of-speech<adjective>
:gender($g) # default of Whatever matches all
:number($n)
}
}
Without an object
option, the TOP and noun tokens would be a bit messier:
grammar ModifiedNoun {
token TOP {
<noun>
<adj-list:
$<noun>[0].gender,
$<noun>[0].number
>
}
token noun {
<[Word: :part-of-speech<noun>]>
}
which works, I guess, but just isn't as clean.
For use in inline situations, all of the //
syntax would be available but adding on as an option :obj
:
- match:
m:obj:options/find/
- substition:
s:obj:options/find/replace/
- substition (nondestructive):
S:obj:options/find/replace/
- transliteration:
tr:obj:options/swap/swap/
- transliteration (nondestructive):
TR:obj:options/swap/swap/
- Updated April 17th to discuss the class of the
&
operator with theX
andZ
values, and fixed a few other typos (/foo/bar/options
isn't Raku, duh) - Updated April 9th to integrate bgills' excellent suggestions on
X
andZ
and inline naming, and fixed typos
@gitjeff2 Thank you for taking a look at it! I actually do more work with object pattern matching (the whole impetus for this project came from wanting regex-like matching on a language corpus), but I decided to work on Binex first because I felt like it would be more useful for people than Objex, so any insight from people that work or have worked with binary stuff is really appreciated. Sorry for the long answers coming but you had lots of interesting stuff/questions.
This is definitely an interesting use case. What would you imagine the interface looking like? Right now, the way for doing that type of stuff would be (assuming a grammar that defines a
token a
andtoken b
)Where
~&
is Raku's blobby binary AND (this is how it works in regex, I haven't implemented actions just yet). But that does mean that it's not as cleanly available: you'd need to reference$<foo>.made
instead of just$<foo>
. That can be done without an actions object several different ways, by using a code block in the token, like:1or by using a dynamic variable that would have been defined in the enclosing token:
The latter is probably the better one, because the dynamic variable can still be accessed in the method call. Using
make
will thwart anyone writing an actions class. I wonder, though, if there could be a better interface (or honestly, if there should be one to avoid kitchen-sinking it). Perhaps in addition tomake
to push data to the AST, there could also be ascour
to generate the scoured data. The big problem, of course, is that any modification of data in place takes a performance penalty because we have to allocate, calculate, and hold on to an extra blob. Dynamic variables are a bit more lightweight here.There are actually much better ways to handle this type of a situation. You can define a dynamic var
$*checksum
and at each place where it needs to be recomputed, do something like:If the checksum is only being done on the data-chunk, it could even be inlined directly in
data-chunk
. But if you can't check it until you're finished parsing, it might be better to delay doing the checksum until you have the full AST.Changing how a chunk is parsed based on a bit actually can already be done easily (if you peak around in Raku's own grammar, a similar technique gets used a lot):
That's one way. Here's another way that will separately capture the flag and the (absolute) value. It's wordier, but perhaps more self-documenting and better expands to other situations:
When aligned to bytes as these are, they could still be handled easily without needing to go to a subbyte level:
Or alternatively, the following might make procesing a bit easier, but wouldn't be strictly necessary.
Since it runs on just the binary data, it won't care. This is more of a concern for regular regex which handles private use without problem (it's still part of the Unicode standard, just defined as a glyph without defined properties). What I would do for this is after processing it (assuming it, like UTF-8 for instance, needs to be handled in some interesting way at the byte level), handle the encoding in an actions object. That way you could say
FormatGrammar.parse($string-blob, :actions<FormatActions>).made
where yourTOP
tokenmake
s a Unicode-encoded value.There acutally has been some interesting work on this done by someone (I can't find the link) with regular regex, by doing some sort of Raku magickery. If in Raku we defined a standard format for serialization, than there's no reason we could at least parse it. I would definitely leave that to module-space, though. I believe someone already did similar for JSON, so there's a model to work from (I mean, you can already
.raku
an object to get Raku code that, in theory, should recreate the object if youEVAL
it)I don't have any experience working with that, so I fear any speculation would be idle and not helpful. If the data you get comes in in chunks, you probably could use it on each chunk, and let the result of that dictate how future chunks are processed (by setting a flag, etc). @jnthn has been doing some stuff with audio processing lately, so maybe he could chime in.
Also, welcome to Raku! I think you'll find it a language that's legitimately fun to code in. One thing I'd definitely say is to check out doing some grammars and try writing accompanying actions classes for them (or make a separate new action class for an extant grammar). It sounds a bit like you might be doing what I did early on and try to put too much action into the grammar (I tried writing a brainfuck interpreter that would run inside of the grammar itself. Doesn't work beacuse tokens might fail but the side effects were still there. Making an AST in an actions class afterwards instead made everything work better and cleaner). If you have any questions, definitely check out #raku on freenode, it might take a bit before someone sees the question, but someone will always eventually get around to answering.
}
is the end of the block.