- msgpack provides users with a mechanism to deserialize/serialize string objects transparently without causing incompatibility
- msgpack provides upper-layer code of msgpack with a mechanism to define original types without changing msgpack spec
- msgpack keeps compatibility and doesn't cause any impacts on existent code even if new types were defined to msgpack in the future
- Add "Extension" type
- Add FixExt, ext 8, ext 16, ext 32 formats
- Define Binary type as a part of the Extension type (Extension tag=0)
- Applications can define custom types using the "Extension" type
- Deserializers with "binary_extension" enables users to distinguish byte arrays from strings
- deserializers without "binary_extension" offers perfect compatibility with existent data and implementations
- Raw type keeps defined as a type representing ambiguous data
Here classifies programs into 3 groups:
- weak-string code: programs where the distinction between strings and byte arrays is ambiguous (from serializers' point of view)
- programs written in languages which don't have types to a distinguish strings and byte arrays (PHP, C++, Erlang, OCaml, etc)
- programs written in languages which uses flags or other additional information to distinguish strings but those information are not used in some common use cases (e.g.: Perl, etc)
- statically-typed strong-string code: programs where a distinction between strings and byte arrays is clear (from serializers' point of view) and type information are given to deserializers (deserializers can use the type information as a schema)
- dynamically-typed strong-string code: distinction between strings and byte arrays is clear but type information is not given to deserializers, and programs suppose restored objects have supposed type information
It's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in weak-string code. Because programmers need extra work to set markers which mean "this is a string" on all strings, or markers which mean "this is a byte array" on all byte arrays.
And validation before storing every strings impacts on performance significantly. Thus even if msgpack has a type to represent UTF-8 strings, deserializers can't always assume it always contains valid UTF-8 strings.
On the other hand, In object exchanges where a strong-string dynamically-typed language is a deserializer, there're requirements to transparently restore data stored as a string as a string, and data serialized as a byte array as a byte array. This requirements don't exist just excepting this case.
Applications have 2 options for above problem:
- ambiguity-tolerant behavior: deserializers accept data which don't distinguish strings and byte arrays clearly. These deserializers don't require serializers to distinguish strings and byte arrays clearly. (This is same as the current msgpack)
- ambiguity-strict behavior: deserializers assume that data distinguish strings and byte arrays clearly. These deserializers require serializers to distinguish strings and byte arrays clearly.
I assume ambiguity-tolerant behavior and ambiguity-strict behavior don't exist at the same time.
- If there're at least one deserializer which works with ambiguity-strict behavior, strings and byte arrays have to be distinguished clearly in all data
- Otherwise, all deserializers have to work with ambiguity-tolerant behavior.
Deserializers working with ambiguity-strict behavior assume that Raw type includes only valid UTF-8 strings, and byte arrays are stored using Binary type (which is newly added as a part of Extension type).
Above limitation provides following advantages:
- We don't have to add both String and Binary types. Thus msgpack can store strings in smaller bytes
- Applications which don't use byte arrays at all don't have to worry about the ambiguity of strings and byte arrays
- Here assumes that amount of applications which don't use byte arrays is larger than one of applications which don't use strings
- We can keep the msgpack's type system simple
On the other hand, it brings following disadvantages:
- It's difficult to switch the behavior of deserializers to ambiguity-strict behavior
- Users need to keep using ambiguity-tolerant behavior, or change code and convert all data at the same time
Note: Even the other methods may not able to solve this disadvantage
- Extension type: a tuple of a byte array and an integer called Extension tag
- Binary: part of the Extension type (Extension tag=0). Binary represents byte arrays
- Raw: UTF-8 encoded strings, or byte arrays
- Applications may agree that Raw represents only valid UTF-8 encoded strings (ambiguity-strict behavior)
- Those applications assume that Raw represents only valid UTF-8 encoded strings. and byte arrays are stored using the Binary types
In the applications assuming the Raw type represents only valid UTF-8 strings, it depends on implementations how the deserializers handle strings containing invalid byte sequence as a UTF-8 string.
0xc0 11000000 nil (Nil type)
0xc1 11000001 (never used)
0xc2 11000010 false (Boolean type)
0xc3 11000011 true (Boolean type)
0xc4 11000100 FixExt 4 (Extension type 4byte) // new
0xc5 11000101 FixExt 5 (Extension type 5byte) // new
0xc6 11000110 FixExt 6 (Extension type 6byte) // new
0xc7 11000111 FixExt 7 (Extension type 7byte) // new
0xc8 11001000 FixExt 8 (Extension type 8byte) // new
0xc9 11001001 ext 8 (Extension type 8bit) // new
...
0xd4 11010100 FixExt 0 (Extension type 0byte) // new
0xd5 11010101 FixExt 1 (Extension type 1byte) // new
0xd6 11010110 FixExt 2 (Extension type 2byte) // new
0xd7 11010111 FixExt 3 (Extension type 3byte) // new
0xd8 11011000 ext 16 (Extension type 16bit) // new
0xd9 11011001 ext 32 (Extension type 32bit) // new
0xda 11011010 raw 16 (Raw type 16bit)
0xdb 11011011 raw 32 (Raw type 32bit)
0xdc 11011100 array 16 (Array type 16bit)
0xdb 11011101 array 32 (Array type 32bit)
0xde 11011110 map 16 (Map type 16bit)
0xdf 11011111 map 32 (Map type 32bit)
Format of the Extension type:
FixExt 1
+--------+--------+--------+
| 0xd5 | 0xTT |XXXXXXXX|
+--------+--------+--------+
=> 1 bytes of application-specific object
ext 8
+--------+--------+--------+--------
| 0xc9 | 0xTT |XXXXXXXX|...N bytes
+--------+--------+--------+--------
=> XXXXXXXX (=N) bytes of application-specific object
Where "0xTT" means a 1-byte integer which represents a Extension tag.
Binary type is a part of the Extension type, and uses 0 for the Extension tag.
Implementations of serializers and deserializers should offer applications an option "binary_extension" so that applications can chose ambiguity-strict behavior or ambiguity-tolerant behavior.
- Serializers:
- if binary_extension=false, serializers store byte arrays using the Raw type
- if binary_extension=true, serializers store byte arrays (which are clearly distinguished from strings) using the Binary type (Extension type where tag=0)
- in weak-string code, serializers use the Raw type to store strings or byte arays regardless of the binary_extension option
- for languages which don't have types to distinguish strings and byte arrays, msgpack implementations provide users with a way to set markers on byte arrays (such as a wrapper class)
- Deserializers:
- if binary_extension=false, deserializers don't validate UTF-8 on restoring Raw type at all. If the language can't include invalid byte sequence within a string object, deserializers don't restore Raw type into the string type. (ambiguity-tolerant behavior)
- If binary_extension=false, deserializers may restore Binary type and Raw type into the same type
- If binary_extension=true, deserializers restore Raw type into a string object. (ambiguity-strict behavior)
- If binary_extension=true, deserializers may validate UTF-8 strings on restoring Raw type. Although it depends on implementations how the deserializers handle strings including invalid byte sequence as a UTF-8 string, Here are some examples:
- it returns an instance of a special class which has a field to hold the original byte sequence
- it calls a registered callback function and returns the value returned by the function
If an implementation enables binary_extension=true by default, it should be clearly described in documents. Typical msgpack implementations would enable binary_extension=true by default.
If some types are added to msgpack in the future, its implementation would be as following (I used Time type for example):
- Serializers:
- if time_extension=true, serializers automatically use Time type (which is a part of Extension type) to store time objects
- If time_extension=false, serializers don't automatically use the Time type
- Deserializers:
- If time_extension=true, serializers restore Time type into a time object
- If time_extension=false, deserializers restore the object into a tuple of an integer and a byte array.
Wrapper libraries of msgpack can define original types using the Extension type without affecting the msgpack specification.
The MessagePack specification without the Extension type is named "Basic Profile." Applications are required to use ambiguity-tolerant behavior.
Data which follow Basic Profile represent same data regardless of applications' choice. Thus users can keep data loosely-coupled from applications.
Note: existent msgpack implementations can be assumed that they support only the Basic Profile.
The MessagePack specification with the Extension type is named "Application Profile." Applications can choise ambiguity-strict behavor or ambiguity-tolerant behavior.
Applications can define application-specific types using the Extension type.
Note: This is one of the possible future discussion.
- type of the keys of maps must be Raw
- keys of maps must be sorted by bytes
- objects must be stored using the smallest format
- In a minor release, deserializers support the Extension type with tag=0 (Binary type) and returns the type same with the Raw type
- In a major release, deserializers and serializers support binary_extension option
- If binary_extension is enabled by default, it should be described in documents
- In a major release, deserializers support the Extension type and return an object of an original class (or something) which represents a tuple of integer and byte arrays
- In a major release, serializers support the Extension type and store objects of the original class (or something) using the Extension type
That assignments make implementation of serializers and deserializers simple.
We can optimize the implementation of deserializers as follows:
int length;
switch(b) {
case 0xc4..0xc8:
length = b & 0x0f;
goto fixext;
case 0xd4..0xd7:
length = b & 0x03;
fixext:
// …
break;
}
or:
if((0xc4 <= b && b <= 0xc8) || (0xc4 <= && b == 0xd7)) {
length = (b & 0b1111) ^ ((b & 0b10000) >> 2);
// …
}
We can optimize the implementation of serializers as follows:
if(length <= 4) {
int b = 0xd4 | length;
// …
} else if(length <= 8) {
int b = 0xc0 | length;
// …
} else {
…
}