-
-
Save Juerd/ae574b87d40a66649692 to your computer and use it in GitHub Desktop.
RFC: A more Perl6-esque "unpack" | |
================================ | |
This is an idea for an "unpack" replacement. The basic reasoning behind it, is | |
that number encodings and string encodings needn't be treated all that | |
differently. Instead of passing the name of a string encoding, you can pass | |
a native type object. When decoding things of determinable lengths, any number | |
of types can be given. | |
A variable length thing without a length indication can only be passed at the | |
end. | |
Decode according to a template: | |
$blob.decode( [ ... ] ) | |
Decode a string: | |
my $s = $blob.decode("utf8") | |
# actually short for: $blob.decode([ ::Inf => "utf8" ]) | |
Decode a natively encoded numeric value: | |
my $i = $blob.decode(uint16); | |
Decode a natively encoded numeric value, and a string: | |
my ($n, $s) = $blob.decode([ num, "latin1" ]); | |
This doesn't work: | |
my ($s, $i) = $blob.decode([ "latin1", uint16 ]); # FAILS | |
# Can't determine string length! | |
Force endianness for a single value: | |
my $i = $blob.decode([ :big(uint32) ]); | |
Set default endianness for the rest of the template: | |
my @i = $blob.decode([ :big, uint32, uint16, uint8 ]); | |
Decode two byte-length-prefixed blobs: | |
my ($blob1, $blob2) = $blob.decode([ ::uint32 => Blob, ::uint32 => Blob ]); | |
or: | |
my ($blob1, $blob2) = $blob.decode([ (::uint32 => Blob) xx 2 ]); | |
Decode any number of byte-length-prefixed blobs: | |
my @blobs = $blob.decode([ ::Inf => [ ::uint32 => Blob ] ]); | |
Decode any number of byte-length-prefixed strings: | |
my @strings = $blob.decode([ ::Inf => [ ::uint32 => "Windows-1252" ] ]); | |
A list of equityped things, with a counter prefix (as opposed to byte length): | |
my @i = $blob.decode([ :elems(uint8) => uint32 ]); | |
A sub-template with a typed byte length prefix: | |
[ ::uint32 => [ int32, uint16, "latin1" ] ] | |
A list of equityped things, with a BYTE length prefix: | |
[ ::uint32 => uint32 ] | |
Skipping a byte with Nil (when packing (encoding), Nil becomes \0): | |
[ int, int, int, Nil, int, int ] | |
User-defined number encoding in the mix: | |
my ($command, $param) = $blob.decode([ :big, uint8, MQTT::Length => Blob ]); | |
if $command == 0x30 { | |
my ($topic, $message) = $param.decode([:big, | |
::uint16 => "utf8", | |
Blob | |
]); | |
} | |
Note that: | |
* The KEY of a pair is part of the template, but NOT of the actual data returned | |
by decode. This holds true for length prefixes (key is a type object) and for | |
hints like :big and :little (key is a string). | |
* Pairs can nest like this : | |
:big(uint16) => Blob | |
:elems(:big(uint16)) => uint64 | |
* The compiler will eat pairs, thinking they're named arguments. This is why | |
templates are arrays. | |
Things that P5's unpack does, that this proposal does not cover: | |
* Hexadecimal, binary, or uuencoded strings. These are actually string | |
encodings, and should be implemented as such. (p5 <b B h H u U>) | |
* Absolute position based extraction ('@' and '.' in p5's pack). Don't know if | |
this is actually ever used, or how it even works. | |
* Pointers to strings. | |
* Null-terminated strings. Just have a Nil in there. | |
Juerd <[email protected]> |
It'd really be nice to have terminator-specified decoding work somehow. like a "latin1" or "utf8" could be understood/configured to stop at the first null-byte it finds.
It took me a while, but I understand this and it makes some degree of sense. Especially since, as designed, it will fit into the already existing implementation.
Here's what took me some time to understand. Using the above example:
8, 0, 0, 0, #uint32 length 0x08
1, 0, 0, 0, #int32
2, 0, #uint16
65, 66, #"latin1" (can only be 2 chars here since it all needs to fit in 0x08 bytes!
9, 0, 0, 0, #uint32 length 0x09
3, 1, 0, 0, #int32
2, 2 #uint16,
67, 68, 69 #"latin1" (again, can only be 3 characters because length prefix is 9 bytes.
How do you suggest handling things like headers, though? In situations where the string length is known, it seems remiss to not include them in this design. Here's a suggestion following what you have already thought up. We just extend the byte-prefix notation to include a static length:
my $b = Buf.new(65, 66, 67, 68);
my @i = $b.decode( 4 => "latin1" );
my @expected = ("ABCD");
is-deeply @i, @expected,
"extracting a sub-template with a byte length prefix";
Is there a particular reason why you think something like this is unnecessary?
Yes. I don't know which endianness should be the default, though. Let's ask Gulliver when he returns from his travels...