Skip to content

Instantly share code, notes, and snippets.

@Juerd
Last active April 25, 2016 20:30
Show Gist options
  • Save Juerd/ae574b87d40a66649692 to your computer and use it in GitHub Desktop.
Save Juerd/ae574b87d40a66649692 to your computer and use it in GitHub Desktop.
RFC: A more Perl6-esque "unpack"
================================
This is an idea for an "unpack" replacement. The basic reasoning behind it, is
that number encodings and string encodings needn't be treated all that
differently. Instead of passing the name of a string encoding, you can pass
a native type object. When decoding things of determinable lengths, any number
of types can be given.
A variable length thing without a length indication can only be passed at the
end.
Decode according to a template:
$blob.decode( [ ... ] )
Decode a string:
my $s = $blob.decode("utf8")
# actually short for: $blob.decode([ ::Inf => "utf8" ])
Decode a natively encoded numeric value:
my $i = $blob.decode(uint16);
Decode a natively encoded numeric value, and a string:
my ($n, $s) = $blob.decode([ num, "latin1" ]);
This doesn't work:
my ($s, $i) = $blob.decode([ "latin1", uint16 ]); # FAILS
# Can't determine string length!
Force endianness for a single value:
my $i = $blob.decode([ :big(uint32) ]);
Set default endianness for the rest of the template:
my @i = $blob.decode([ :big, uint32, uint16, uint8 ]);
Decode two byte-length-prefixed blobs:
my ($blob1, $blob2) = $blob.decode([ ::uint32 => Blob, ::uint32 => Blob ]);
or:
my ($blob1, $blob2) = $blob.decode([ (::uint32 => Blob) xx 2 ]);
Decode any number of byte-length-prefixed blobs:
my @blobs = $blob.decode([ ::Inf => [ ::uint32 => Blob ] ]);
Decode any number of byte-length-prefixed strings:
my @strings = $blob.decode([ ::Inf => [ ::uint32 => "Windows-1252" ] ]);
A list of equityped things, with a counter prefix (as opposed to byte length):
my @i = $blob.decode([ :elems(uint8) => uint32 ]);
A sub-template with a typed byte length prefix:
[ ::uint32 => [ int32, uint16, "latin1" ] ]
A list of equityped things, with a BYTE length prefix:
[ ::uint32 => uint32 ]
Skipping a byte with Nil (when packing (encoding), Nil becomes \0):
[ int, int, int, Nil, int, int ]
User-defined number encoding in the mix:
my ($command, $param) = $blob.decode([ :big, uint8, MQTT::Length => Blob ]);
if $command == 0x30 {
my ($topic, $message) = $param.decode([:big,
::uint16 => "utf8",
Blob
]);
}
Note that:
* The KEY of a pair is part of the template, but NOT of the actual data returned
by decode. This holds true for length prefixes (key is a type object) and for
hints like :big and :little (key is a string).
* Pairs can nest like this :
:big(uint16) => Blob
:elems(:big(uint16)) => uint64
* The compiler will eat pairs, thinking they're named arguments. This is why
templates are arrays.
Things that P5's unpack does, that this proposal does not cover:
* Hexadecimal, binary, or uuencoded strings. These are actually string
encodings, and should be implemented as such. (p5 <b B h H u U>)
* Absolute position based extraction ('@' and '.' in p5's pack). Don't know if
this is actually ever used, or how it even works.
* Pointers to strings.
* Null-terminated strings. Just have a Nil in there.
Juerd <[email protected]>
@Juerd
Copy link
Author

Juerd commented Jan 6, 2016

Yes. I don't know which endianness should be the default, though. Let's ask Gulliver when he returns from his travels...

@timo
Copy link

timo commented Apr 25, 2016

It'd really be nice to have terminator-specified decoding work somehow. like a "latin1" or "utf8" could be understood/configured to stop at the first null-byte it finds.

@Xliff
Copy link

Xliff commented Apr 25, 2016

It took me a while, but I understand this and it makes some degree of sense. Especially since, as designed, it will fit into the already existing implementation.

Here's what took me some time to understand. Using the above example:

      8, 0, 0, 0,     #uint32 length 0x08
      1, 0, 0, 0,     #int32
      2, 0,           #uint16
      65, 66,         #"latin1" (can only be 2 chars here since it all needs to fit in 0x08 bytes!
      9, 0, 0, 0,     #uint32 length 0x09
      3, 1, 0, 0,     #int32
      2, 2            #uint16,
      67, 68, 69      #"latin1" (again, can only be 3 characters because length prefix is 9 bytes.

How do you suggest handling things like headers, though? In situations where the string length is known, it seems remiss to not include them in this design. Here's a suggestion following what you have already thought up. We just extend the byte-prefix notation to include a static length:

my $b = Buf.new(65, 66, 67, 68); 
my @i = $b.decode( 4 => "latin1" );
my @expected = ("ABCD");
is-deeply @i, @expected,
        "extracting a sub-template with a byte length prefix";

Is there a particular reason why you think something like this is unnecessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment