This is designed to be a working proposal. Comments/corrections/suggestions are welcome, as the first draft was written fairly hastily. I'm working on doing a rough implementation to play around with, beginning with the Binex proposal, which can be found here. It is not currently a full implementation of this proposal, but progressing rapidly.
Grammars in Raku are awesome, and allow for some truly amazing text parsing.
Unfortunately, they are less than ideal for binary files,
and there is no way to have them support matching objects, both of which would be very useful (image being able to pattern match on an AST!)
This requires writing complex and/or error prone workaround code.
Create an easy-to-use, Raku-ish way to handle grammars that are binary or objecty.
Practically speaking, these are two separate proposals, and will likely involve different optimizations,
but are treated together so that their end-user solutions are as similar as posisble, e.g., saying Grammar is binary
or Grammar is objecty
and then modifying the interpretation of the tokens to a regex-like slang.
A basic proposal binary grammar would look something like this:
grammar UTF-8 is binary[8] {
token TOP { <byte-order-mark>? <utf-8-glyph>* }
token byte-order-mark { xEF xBB xFF }
proto token utf-8-glyph { * }
token utf-8-glyph:single { b0....... }
token utf-8-glyph:double { b110..... b10...... }
token utf-8-glyph:triple { b1110.... b10...... b10...... }
token utf-8-glyph:quadruple { b11110... b10...... b10...... b10......}
proto token utf-8-stream { <byte-order-mark> <utf-8-glyph> * }
}
Where x00
represents a byte in hexademical, o0000
in octal, b00000000
etc. For simplicity's sake, each byte should be written out in full.
Because some grammar definitions may benefit from it, while the default unit would be a byte, it might be useful to base the grammar not on a byte by byte sequence, but rather words of 16, 32, or 64 bits, enabled via parameterization (is binary[16]).
In such cases, an underscore may delineate groups but is otherwise ignored, e.g. 0xffff_ffff
for a 32 bit hex value, although 0xf_f_f_f_f_f_f_f
would be theoretically valid too).
In a binary grammar, strings are considered invalid, either bare or quoted, although they could included via a method that returns a Blob
(similar to a method that returns a string).
Alternatives can be given like in regular Regex, using |
(LTM) or ||
(short circuit).
For character classes, I see two useful ideas:
<[ x00 .. x1f ]>
would match values from 0 to 31.<[ b.......1 ]>
would match odd numbers.<[ b.......1 b00000000]>
would match odd numbers or 0.
The middle one, of course, would seem to be pointless given a bare b.......1
would be valid, but when used as a negator, it could be a fair bit more powerful, where < +[x80 .. xff] -[b.......1]>
would represent all odd upper ASCII values. I think it would be optimal and not particularly complex to allow a construction like o00.0 .. o04.8
and treating it similar to the string range, e.g., 00.0, 00.1 … 00.8, 01.0, 01.2
with the dot preserved as a wildcard in all. An optimization stage can try to determine if there's a compact representation < +[x80 .. xff] -[b.......1]>
becomes b1..._...1
, and if not, fall back to a sequential test.
For use in inline situations, all of the //
syntax would be available but adding on as an option :bin
:
- match:
m:bin:options/find/
- substition:
s:bin:options/find/replace/
- substition (nondestructive):
S:bin:options/find/replace/
- transliteration:
tr:bin:options/swap/swap/options
- transliteration (nondestructive):
TR:bin:options/swap/swap/
One issue that seems odd, but with real world use, would be to allow captures/tokens betwixt bytes/words. In the aforementioned Zelda 3 article, the format would effectively be for us:
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk { <header> <data: +$<header><length> > }
token header { b..._..... }
token data($count) { x.. ** {$count} }
}
The catch, however, is how to handle the splitting up of header into the command (first three bits) and the length (latter five bits). I'm not sure what the best syntax to use here would be. No doubt there are other formats where a sub byte item might even be repeated. In this particular case, a work around could be to say
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk {
my $*cmd;
my $*length;
<header>
<data: $*cmd, $*length>
}
token header {
b..._..... {
$*cmd = +$¢ +> 5;
$*length = +$¢ +& 31 + 1; # length 0 is 1
}
}
enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
multi token data( Copy , $count) { x.. ** {$count} }
multi token data(ByteRept | ByteIncl, $count) { x.. }
multi token data(WordRept | CopyExst, $count) { x.. ** 2 }
}
While that would work, it seems inelegant (and making it impossible to handle a token that ends in the middle of a byte/word). Instead, we'll provide an additional option of X
and Z
, where X
means “bit I don't care about, shove it out of the way” and Z
means “bit I don't care about, but want it zeroed out”.
The &
/&&
conjunctions are not commonly used in string-based grammars, but this could be a great place to use them with regularity. Because you could do (at least if needing to split a single byte):
grammar is binary[8] {
token TOP { <chunk>+? <end> }
token end { xFF }
token chunk {
[<cmd> && <length>]
<data: $<cmd>.head, $<length>.head>
}
token cmd { b..._XXXXX }
token length { bZZZ_..... }
enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
multi token data( Copy , $count) { x.. ** {$count} }
multi token data(ByteRept | ByteIncl, $count) { x.. }
multi token data(WordRept | CopyExst, $count) { x.. ** 2 }
}
The open question with this approach is what the match value of should be. To reduce the problem:
token a { <x> & <y> }
token b { <x> <x> }
token c { b....XXXX <x> }
token x { b....XXXX }
token y { bZZZZ.... }
When matching b11110001
on token a
, we'd want x
to blobify to b00001111
, and y
to b00000001
. But what would we want to a
to blobify to? We have three options: the original match (b11110001
), and either of the two captures (b00001111
or b00000001
). The answer might seem obvious to just use the original match, but someone might want to do something like in token b, and when matching b11110001 b10100011
expect b
to blobify to b00001111 b00001010
.
Some off-the-top-of-my-head potential solutions, without regard for complexity of implementation and no particular order:
- Only modify literals within a given token
Tokenc
above would blobify r-shifting the first byte, but leaving the second in place, but blobifying$<c><x>
would reveal the modification specified inx
- Create two different methods of blobifying.
One would return a match directly, and the other would return the modified value (probably the direct match as default). The problem of tokena
would remain, though, as there would now be two modified values, and if there were a second junction, at least four, etc., with no clear way to distinguish them. - Scrap the idea entirely
I don't really like this one, but I s'pose it's one solution. - Only certain tokens allow
Z
orX
This could be done via a traitis scoured
or with a different declarator. Those tokens would gain the ability to useZ
andX
in their definitions, but lose the ability to use&
,&&
(|
and||
would not be affected, since they match one item). Those special tokens that include other special tokens will use the modified values in place, since the lack of&
operands means we can guarantee no overlapped values. To use the match values, use a regular token, or as a one-off option, perhaps the syntax<,foo>
which is currently otherwise invalid.
My test implementation doesn't yet handle the operators, so I've not had to deal with the question too much yet, but it's looming.
Perhaps in a later version of the standard (because of the complexity of the code to support it and no doubt speed implications), an optional trait "is maligned" (because O(fun) names) could be added later to allow for non-full-byte/word tokens, without compromising previous code.
The idea for the object grammar came to me when I was processing some part-of-speech tagged text. Each word was an object whose class looks something like this (simplified for this document).
class Word {
has $.word;
has $.lexeme;
has $.part-of-speech;
has $.number;
has $.gender;
has $.tense;
has $.person;
}
For matching with objects, I think usurping the character class syntax, and hacking it a bit would provide a nice, generally clear syntax to allow for matching on types or attributes/values.
grammar ObjexMatcherSyntax {
rule TOP { '<' ~ '>' <match-container>+ }
rule match-container { '[' ~ ']' <match> }
rule match { <type>* ':' <arguments> }
rule type { <sign> <typename> }
token sign { '-' || '+'? }
}
Arguments would follow standard Raku syntax, with the following interpretations:
- Positional arguments are smartmatched against the object (e.g.
<[Int: 1]>
would match an Int value of 1, and<[Int: * > 5, * %% 2]>
would match all even Int values over 5 (6,8,10, etc, but not 1 or 7). - Attribute arguments are similar smartmatched against the object's attribute. So
<[Rat: :denominator(1)]>
would match only whole number Rats, and<[Rat: :denominator(1,2,4 ... *)]>
would match any power-of-two denominator because smart matching a list checks to see if it contains the element.
It may be that adding the +/-
syntax for the type is overkill, and it would be better to keep with only additives, using the pipe |
that's used elsewhere in Raku (after all, if someone really wanted, they could define a subset that explicitly handled more complex types). That would greatly simplify the syntax. Thoughts?
Maybe it's just for my initial use case, but I feel like the typical use case for an Objex would want quicker access to the values/attributes of matched objects. Maybe that's just me though. But it definitely presents a different usecase over strings. Rarely, if ever, do we care about the distinction between a character (single element) and a string (sequence) because Raku doesn't distinguish them. But when dealing with objects, such a distinction IS suddenly important as character : object :: string : list. For this reason, I think it might be a good idea to add an additional declarator to an Objex, which would be simply object
(surprisingly and luckily, this is not used at all in the Raku spec!). The contents of the object
would be identically to the selector described above (just without the arrow brackets, and only require brackets if more than one selector). Thus the custom declarators of an grammar is objecty
would be:
objex
: backtracking, Match contains a List/Seqrule
/token
: synonymous in our case, Match contains a List/Seqobject
: Match contains an object directly.
I suppose it's possible to avoid adding new declarators and just say rule = sequence, token = one off object, but a new concept deserves a new declarator to avoid confusion. Using this idea, assuming I wanted to identify a sequence as a valid noun + adjective sequence, I might do the following:
grammar ModifiedNoun is objecty {
token TOP {
<noun> # the base noun
<adj-list: # followed by adjectives that
$<noun>.gender, match the noun's gender and
$<noun>.number> match the noun's number
}
token adj-list($g = *, $n = *) {
[
<adj: $g, $n>+ # any number of adjectives that agree
<list-coordinator> # if there's a list, need an and/or at the end.
]
<adj: $g, $n> # an agreeing adjective
}
object noun { Word: :part-of-speech<noun> }
object coordinator {
Word:
:part-of-speech<coordinator>
:lexeme('y'|'o') # only want and/or
}
object adj($g = *, $n = *) {
Word:
:part-of-speech<adjective>
:gender($g) # default of Whatever matches all
:number($n)
}
}
Without an object
option, the TOP and noun tokens would be a bit messier:
grammar ModifiedNoun {
token TOP {
<noun>
<adj-list:
$<noun>[0].gender,
$<noun>[0].number
>
}
token noun {
<[Word: :part-of-speech<noun>]>
}
which works, I guess, but just isn't as clean.
For use in inline situations, all of the //
syntax would be available but adding on as an option :obj
:
- match:
m:obj:options/find/
- substition:
s:obj:options/find/replace/
- substition (nondestructive):
S:obj:options/find/replace/
- transliteration:
tr:obj:options/swap/swap/
- transliteration (nondestructive):
TR:obj:options/swap/swap/
- Updated April 17th to discuss the class of the
&
operator with theX
andZ
values, and fixed a few other typos (/foo/bar/options
isn't Raku, duh) - Updated April 9th to integrate bgills' excellent suggestions on
X
andZ
and inline naming, and fixed typos
Ha, no worries. I put this on the back burner until RakuAST comes out. I really appreciate the detailed examples — they will give me plenty of food for thought.
Thankfully, I'm happy to take time with this. I'd rather get it right than do it fast.