Skip to content

Instantly share code, notes, and snippets.

@Brian151
Created September 19, 2023 08:19
Show Gist options
  • Save Brian151/d2cb81b83ba752469ffe9771ebc82f50 to your computer and use it in GitHub Desktop.
Save Brian151/d2cb81b83ba752469ffe9771ebc82f50 to your computer and use it in GitHub Desktop.
file format :
base : RIFX (formtype : "KESF" ("Kitten Engine Serialization Format"))
chunks:
OBJ_ (with LETTER "o" ) {
// serialized object data
}
HEAD {
// header data (TBD)
U32 flags {
compressed? // TBD
hasOBJ? // valid files must have at least 1 of these 3 set
hasTypes?
hasStrings? // not required, but throw error if no string table exists and a string is encoded as reference
}
// it is perfectly valid to encode :
// database[n].kesf : serialized objects only
// typedefs.kesf : typedefs only
// strings.kesf : strings only
// and then load all three into the parser
// a good reason to do this would be having multiple serialized object files with the same typedefs and string table
}
STR_ {
// strings table (names for sure, literals?)
// String format : length(varint)-prefixed UTF-8
}
TDEF {
// datastuct definitions table
}
META { // ?
// meta-data (such as comments, creation tool, etc...)
// TBD , not required
}
object serialization format :
Byte type { // all objects start with a 1-byte type
u4 dataBaseType
u4 dataSubType
}
basetype = {
number = 0 // ints, floats
string = 1
array = 2
object = 3
datastruct = 4 // required type because datastructs are a stored as raw bytestream
}
subtypeNumber = {
u8 = 0
i8 = 1
u16 = 2
i16 = 3
u32 = 4
i32 = 5
float32 = 6
float64 = 7
u64 = 8 //but JS doesn't natively support
i64 = 9
dec64 = 10 // not natively supported anywhere???
BCD = 11 // may not implement these...
varint = 12
float16 = 13
u24 = 14
i24 = 15
}
subtypeString = {
flag isLiteral // non-zero : string data follows, zero : reference into string array follows
}
subTypeArray = {
number = 0
string = 1
dynamic = 2 // most common
datastruct = 3
numberfixed = 4 // numbers of a known type (preferred if possible since repeating number typeID is wasteful)
}
subTypeObject = {
none = 0 // objects are dynamic
}
subTypeDataStruct = {
// structure varies by implementation and is specified in the datastruct, itself
// to achieve max efficiency, we'll convert this to a bitfield
flags = {
embedName // if 1, include the name reference
embedOrdinal // if 1, include the ordinal ID, takes priority since it's the fastest
hasTypedef // if 1, we expect a name or ordinal. if zero, we assume an external implementation handles parsing
sizeFixed // arrays only, we expect the typedef or an external implementation to supply the length
}
}
format of serialized items :
number = {
[typeID] // an array with type : numberFixed, does not need this
number
}
string = {
[typeID] // if in array, do not encode this
reference or literal, depending on "subtype"
}
array = {
typeID
U32 length
switch(subtype) {
0 :
<length> serialized numbers // maximally inefficient method of encoding number array
1 :
<length> serialized strings
U8Array(Math.ceil(length / 8)) literalOrReferenceFlags
// can skip the string typeID since it's explicitly declared here
2 :
<length> serialized things // since it's anything, all items require their type ID , maximally inefficient, try to avoid this
3 :
U8 flags = {
embedTypedefNames // if 1, each data struct includes its type name; if zero, we declare it here
// if 1, there actually is an included typedef and we should parse this as an object
// if zero, we assume an external implementation decodes the data. useful for hiding your "secrets" and embedding files.
// also the most efficient way to handle [de]serialization
// it is even perfectly valid to include the typedefs in the file, but not use them
hasTypeDef
useOrdinal // if 1, we use type typedef at a provided index in the types array
}
[String typedefName] // name of the typedef to use for this datastruct
[U32] ordinalID
<length> datastructs // this is typically the most effient method of storing objects, assuming a common format exists, ex. a map file in some game
4 :
numberTYPEID
<length> numbers of a type:TYPEID, this is the optimal way to encode an array of numbers because we don't need to specify a typeID per-number
}
}
object = {
typeID
U32 byteLength
Byte[] data(byteLength) = {
Array [ // we do not actually count these, just read/write them, we stop whenever byteLength bytes have been read or write byteLength when all properties have been serialized
{
String name
serializedData value
}
]
}
}
datastruct = {
typeID
[String typedefName] // in an array, we can and should [try to] skip encoding the type IDs per-entry
[U32 typedefOrdinal] // takes priority over name since it's faster
[u32 byteLength] // if it has a known size, we don't need this
Byte[] structData
}
// type definitions are essentially the object format again, but instead of the value, they encode the primitive type and the name
datastruct typedef format = {
// this being the actual type definition, name is mandatory
String name
{implicit U32 ordinalID} = the index of this item in the typedef array
U32 byteLength
U32 structFixedSize // can be zero, if non-zero and we're reading from an array : parser should assume encoded struct data is this size
Byte[] data(byteLength) = {
Array [ // we do not actually count these, just read/write them, we stop whenever byteLength bytes have been read or write byteLength when all properties have been serialized
{
typeID type
String name
}
]
}
}
examples :
given JSON :
{
"tuna" : "fish",
"catnames" : ["garfield","felix","nolegs"],
"a" : {
"b" : {
"c": 0,
"d": 1,
"e": 2,
"f": 3,
}
}
} // 148 chars, prob 148 bytes in this case.
return serialized object (represented as string with comments denoting bytes and sizes) :
Object { // 30 <LL LL LL LL> (5)
Prop {type = String , name = "tuna", value = "fish"}, // 11 04 <T- U- N- A-> 11 04 <F- I- S- H-> (12)
Prop {type = Array<String>, name = "catnames" , value = ["garfield","felix","nolegs"]},
// 11 08 <c- a- t- n- a- m- e- s-> 21 00 00 00 03 (15)
0b{1110 0000} (1)
[
08 <g- a- r- f- i- e- l- d->, (9)
05 <f- e- l- i- x->, (6)
06 <n- o- l- e- g- s->, (7)
]
Prop {
type = Object,
name = "a", // 11 01 <a-> (3)
value = Object { // 30 <LL LL LL LL> (5)
Prop {
type = Object,
name ="b", // 11 01 <b-> (3)
value = { // 30 <LL LL LL LL> (5)
Prop {type = U8 , name = "c", value = 0}, // 11 01 <c-> 00 00 (5)
Prop {type = U8 , name = "d", value = 1}, // 11 01 <d-> 00 01 (5)
Prop {type = U8 , name = "e", value = 2}, // 11 01 <e-> 00 03 (5)
Prop {type = U8 , name = "f", value = 3}, // 11 01 <f-> 00 04 (5)
}
}
}
}
} // all total : 91 bytes, marginally smaller even in this absurd case, assuming no math errors
more realistic example, a color palette :
given JSON :
{
"colors" : [
{"r":0,"g":0,"b":0,"a":0}, ... // we'll say there's 256 of these
]
}
return types:
Array (2) [ // 00 00 00 02 (4)
Typedef {
name = "Palette" // 11 07 <P- a- l- e- t- t- e-> (9)
id = 0 // implicit (0)
length // <LL LL LL LL> (4)
structSize = 0 // 00 00 00 00 (4)
def = {
TypeProp {
// we also include a length for array def, if zero, length is supplied in the encoded struct
type = Array<TypeDef(ord:1 = "Color")>(0) // 23 0b{0110 0000} 00 00 00 01 , 00 00 00 00 (10)
name = "colors" // 11 06 <c- o- l- o- r- s-> (8)
}
}
},
Typedef {
name = "Color" // 11 05 <C- o- l- o- r-> (7)
id = 1 // implicit (0)
length // <LL LL LL LL> (4)
structSize = 4 // 00 00 00 04 (4)
def = {
TypeProp {
type = U8 // 00 (1)
name = "r" // 11 <r-> (2)
}
TypeProp {
type = U8 // 00 (1)
name = "g" // 11 <g-> (2)
}
TypeProp {
type = U8 // 00 (1)
name = "b" // 11 <b-> (2)
}
TypeProp {
type = U8 // 00 (1)
name = "a" // 11 <a-> (2)
}
}
}
] // all total 66 bytes (again, math!)
return serialized object :
DataStruct { // 04 0b{0110 0000} (2) , has ordinal, has typedef
ordinal = 1 // 00 00 00 01 (4)
length // <LL LL LL LL> (4)
// 00 00 01 00 (4) , we infer property is named "palette" from typedef[1] : "Palette" and is type Array<Color>, all we need to "define" the array is its length
Array<Color> palette(256) : [
struct Color = { // the typedef also explicitely delcares that we have a data struct with a known type and fixed size
U8 r ,
U8 g ,
U8 b ,
U8 a
} , ... // 4 bytes a piece times 256 entries is 1024 bytes , we don't need any of the struct declaration overhead because the typedefs handled this for as us as well
]
} // all total 1038 bytes
combined, 1104 bytes... many times smaller than the JSON equivalent
also smaller than the serialized object equivalent
another noteworthy optimization is this entire thing can simply be encoded as a U8[](1024)
because color palettes are such a basic structure(an array of 32-bit RGB[A] values, often 256 total), overhead to define them is simply not necesarry
care should always be given to the most optimal way to encode your data
it is missing the point to exclusively use serialized objects
basic rules would be as follows :
1] single declaration object : serialized object
2] array of objects with a common format : array of datastructs + typedefs (if needed)
3] array of integer numbers : array of [U]Ints of appropriate size + format for your data
3] multiple files with a common format [containing objects of a common format] : n files with arrays of datastructs + 1 file with typedefs
4] repeated strings : string table + string reference declaration
so on, use common sense
it also is strongly advised to code-gen encoders/decoders for datastructs
object declaration and typedefs are extremely reflective by nature
datastructs are especially bad because most real-world examples will likely contain nested structs
this means recursively calling the typedef encoder/decoder, which is even more detrimental to performance than reflectively reading/writing an object
the most efficient use of typedefs is for documentation/debugging purposes or authortime tool[chain]s
runtime production code should use generated parsing code or JIT generate/compile the most optimized parser possible
the performance penalties otherwise may be servere, especially with larger and more complex files
that said, just because you don't use the typedef in production, doesn't mean you shouldn't include it
that is solely at your discretion. OS/FOSS-oriented folk would prefer to include type information :P
also, treat typedefs with the same level of importance as source code
it is reasonable to assume that you will change data formats through the development/update cycle
if you lose your type definitions, you're gonna have a bad time.
the file format overhead is not required, but is encouraged.
any "valid"/compliant implementation of the object reader/writer should not care about it
it is however quite useful for organizing stuff
and, it doesn't consume a whole lot of bytes
rationale :
"but...":
JSON ? :
no types, size and bandwidth consumption bloats proportional to complexity/amount of data (a common problem all text-based serializers have).
can trade-off some performance to run compression over the data
BSON ? :
JSON, but binary
KESF can be used exclusively in object serialization format to be "yet another BSON", but this is not how i intended it to be used
messagepack ? :
a step in the right direction, yes... but i still feel its main focus was in being another BSON
same as for BSON, yes, KESF can more or less act as a substitute, that is not its intended use
due to my zealous focus on declaring types and sub-types, there are situations where messagepack probably produces [marginally] smaller files
most likely, however; you mis-used it for that to happen
kaitai structs ? :
uses a text-based seriliazer with a name that says enough
while the side effect of KESF is it can define file formats up to a certain extent, it was designed first and foremost to serialize high-level objects for games and software, compactly
plain old binary ? :
go right ahead, can't get any smaller and more optimized than that.
just remember that it doesn't typically have the same ease of use
in the most optimal conditions [agressive use of typedefs and codegen], production KESF code/files are essentially plain old binary
"that one engine!" :
they each do their own thing, some better than others
in a rare win for unity [IMO], type information is stored in a dedicated section/file rather than constantly re-specified every time an object is encoded
my decision to create a type tree/table is pure coincidence, done in complete oblivion to the fact unity uses a similar system, which i learned several months later
"goals?" :
1] create a serialization format that was reasonably small without compression while still being capable of containing all the information needed to re-construct a high-level object
2] enforce typing as loose or as strict as an invidual developer/team wishes to impose on themselves
3] bridge the gap between "simple text format anyone with notepad can edit" and "small binary format optimized for your computer processor to read", without a paywall/license and tons of complexity as has traditionally been the case
4] self-documenting binary blob without the insanity and bloat traditionally associated with it
5] allow for the possiblity to embed arbitary binary data in a serialized object with minimal overhead and especially no use of hex/base64/etc.. text encodings
6] a file format which clearly identifies itself, declares what it contains, and where those contents are, compactly. RIFX was chosen for the following reasons :
6a] it's a chunk-container format using 4CCs to indentify its chunks. it also specifies an additional "form type" field which should help discriminate files a specific software isn't meant to parse
6aa] the only other known RIFX implementation is macromedia/adobe director/shockwave. it seems likely to be the case that macromedia designed* this format.
"designed" loosely speaking, this is just another incidence of a series of blatant plaigarizations from amgia's IFF format. the "evolution" of RIFX naturally would be IFF (amgia) > RIFF (microsoft) > RIFX (macromedia[?]) > KESF (myself); with each additional party apparently contributing less singiciant change from the chosen base format. (and meaning that i copied it as-is)
6b] it uses big-endian byte order which is more intuitive to HUMAN PROGRAMMERS, unlike its own base format
6c] by default, it has practically no meangingful overhead. all chunks have a 4CC (ASCII String with exactly 4 characters, no length or null terminator. this also means it can be read/written as a U32) ID and Uint32 length. the main chunk also has the aforementioned (6a) 4CC formtype. 12 bytes file header information and 8 bytes section header is pretty decent.
7] as extension of 6, create a data/file format and system which with relative ease can be integrated into any workflow/engine/framework/app/etc... even a novice programmer should be able to figure this out.
"the name?" :
see my game library/engine, "Kitten Engine", once i have an implementation/workflow ready, this will be its default data serializtion format, although using it won't be strictly enforced
WARNING :
in its current stage, this is still in draft phase and likely contains errors/oversight
author/copyright : https://github.com/Brian151/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment