heroseh/enc.md

Encoder/Decoder System

.enc File

Data Types

Basic Types

bool
unsigned integers
- char
- u8
- u16
- u32
- u64
signed integers
- s8
- s16
- s32
- s64
floating point
- f32
- f64

Basic types can also have minimum and maximum values specified when used within a struct type. The value are clamped only clamp when using the generated set functions or when encoding or decoding the data as cson.

Example:

struct Example {
    u16(100, 300)   field_a;
    s16(-100, 300)  field_b;
    f32(0.f, 100.f) field_c;
};

Here is the example of the code generated version of that struct in C:

typedef struct Example Example;
struct Example {
    u16 field_a;
    s16 field_b;
    f32 field_c;
};

#define example_field_a_set(obj, value) ((obj).field_a = clamp_u16(value, 100, 300))
#define example_field_b_set(obj, value) ((obj).field_b = clamp_s16(value, -100, 300))
#define example_field_c_set(obj, value) ((obj).field_c = clamp_f32(value, 0.f, 100.f))

Vector Types

u8x2
u16x2
u32x2
u64x2
s8x2
s16x2
s32x2
s64x2
f32x2
f64x2
u8x3
u16x3
u32x3
u64x3
s8x3
s16x3
s32x3
s64x3
f32x3
f64x3
u8x4
u16x4
u32x4
u64x4
s8x4
s16x4
s32x4
s64x4
f32x4
f64x4

Misc Intrinsic Types

Quat
Matrix3d
Transform3d

Struct Type

As you can see by this example, it is very similar to C but has some additions and omissions.

struct Example {
    u32 field;
    u32 array[10];
    u32 count;
    u16 vla[count];
};

VLA Type

struct types support Variable Length Arrays when the array count uses a field name instead of an integer. The VLAs field will be replace by a u32 relative byte offset from this field's memory address. This can be used to access the array using the code generated function

Here is the example of the code generated version of that struct in C:

struct Example {
    u32 field;
    u32 array[10];
    u32 count;
    u32 vla_byte_offset;
};

u16* example_vla(Example* example) {
    return (u16*)(((u8*)example) + example->vla_byte_offset);
}

u16* example_vla_alloc_vla(Example* obj, BinaryWriter* w, u32 count) {
	obj->count = count;
	binary_writer_alloc_vla(Example, w, count, &obj->vla_byte_offset);
	return example_vla(obj);
}

String Types

string types are just like C where they are a null-terminated array of char. this can either be a fixed length array or VLA.

struct Example {
    char name[32];
    u32 description_size;
    char description[description_size];
};

Packed Type

Packed types only work on struct types and they act like supercharged bitfields.

We want to be able to express de/encoding integers, floats and enums down into a smaller series of bits. The packed type allows you to express how many bits you want to pack down, followed by a min and max value. This will allow the system to generate un/packing functions for your structure type automatically.

Example:

struct Example #bitfield(u32) {
    u32                  field_a;
    u16(:8, 100, 300)    field_b;
    f32(:21, 0.f, 100.f) field_c;
    Enum(:3)             field_d;
};

field_a is just a regular field. field_b is a u16 that gets packed down into 8 bits within the range of 100 to 300 inclusively. field_c is a f32 that ges packed down into 21 bits within the range of 0.f to 100.f. field_d is an enum that is packed down into 3 bits with a range from its min to its max value. The more bits the more precision, the less range the more precision.

#bitfield(T) means that bitfields word size and alignment will be of type T where T is a basic type. Bitfields can span over word boundaries at the cost of un/packing performance. If #bitfield(T) is not specified then T will default to u32.

Here is the example of the code generated version of that struct in C:

typedef struct Example Example;
struct Example {
    u32 field_a;
    u32 bitfields0;
};

#define example_field_b(obj) ...
#define example_field_b_set(obj, value) ...
#define example_field_c(obj) ...
#define example_field_c_set(obj, value) ...
#define example_field_d(obj) ...
#define example_field_d_set(obj, value) ...

Union Type

As you can see by this example, it is very similar to C. Like in C, unions are untagged.

union Example {
    u32 integer;
    f32 float_;
};

Enum Type

As you can see by this example, it is very similar to C. And we also support c23 underlying type specifier but we generate this with a typedef as we don't use c23.

// defined in .enc file
enum EntType : s16 {
    ENT_TYPE_PERSON = 0,
    ENT_TYPE_FROG = 1,
    ENT_TYPE_DRAGON = -3,
    ENT_TYPE_PIG,
    ENT_TYPE_MOUSE,
};

// generated C code
typedef s16 EntType;
enum EntType {
    ENT_TYPE_PERSON = 0,
    ENT_TYPE_FROG = 1,
    ENT_TYPE_DRAGON = -3,
    ENT_TYPE_PIG = -2,
    ENT_TYPE_MOUSE = -1,
};

Constexpr

constexpr is supported for basic types only so you can create arrays and packed types that are defined by constants. Because C doesn't support constexpr until c23, we current export it as a #define

// defined in .enc file
constexpr u32 ENT_NAME_CAP = 32;
constexpr f32 PACKED_MIN = 0.f;
constexpr f32 PACKED_MAX = 128.f;
struct Ent #bitfield(u8) {
    char name[ENT_NAME_CAP];
    f32 packed(u8, PACKED_MIN, PACKED_MAX);
};

// generated C code
#define ENT_NAME_CAP 32
constexpr f32 PACKED_MIN = 0.f;
constexpr f32 PACKED_MAX = 128.f;
struct Ent {
    char name[ENT_NAME_CAP];
    u8 packed;
};

Target

Targets are ways of specifying how you wish to use the encoder system with this .enc file. Each target that is enabled can enable features in encoder system as well disable features because they will not be compatible with that target.

Targets cannot be added or removed from an .enc file. You specify them at the top of the file

These sets of target modes, mean we want a versioned file that possibly gets upgraded that can be copied to the gpu and used there:

#target file/gpu binary

These sets of target modes, mean we want a versioned file that possibly gets upgraded that can be copied to over the network:

#target file/net binary

This target mode, means we want to un/pack data in gpu memory:

#target gpu binary

Target: file

We have an editor where we make custom assets and save these out to disk. We also would like to add more features and make changes without losing assets that have already been made. We would also like to be able to use text files most of the time in development so we can visualise the data better. And to be able to ship binary files of assets for speed & size reasons.

When you use the file target, a single .enc file represents a single file type. So you must specify the single root struct type with the #root directive so:

// defined in .enc file
struct Example #root {
    u32 field;
    u32 array[10];
    u32 count;
    u32 vla_byte_offset;
};

// generated C code:
struct Example #root {
    EncBinHeaderV0 header;
    u32 field;
    u32 array[10];
    u32 count;
    u32 vla_byte_offset;
};

The generated code also contains the header field with the magic number and more that you can read about later in the docs.

Every file needs a magic number for ensuring that the correct file format is being processed. The magic number is always a 4 byte hexidecimal number. You specify the #magic directive below the #target directive:

#target file binary
#magic 0x454E434F // ENCO

Target: net

When writing network packets you benefit a lot from packing your structures down before sending them over the wire. This will help you send less data and reduce the amount of data the goes missing and has to be resent.

You also gain access to versioning so you can communicate correct with other clients on the same version.

When reading/writing packets from the packet buffer, the user will cast the byte stream into the code generated struct and manually process the packet themselves.

Target: cpu

When you are writing the data infrequently and reading the same data multiple times and maybe across many threads. You might benefit by packing your data structures to become more cache friendly.

Target: gpu

On the GPU you are usually read the same data across different CUs and this uses up space in caches. Also memory bandwidth is the first major problem to solve. So packing data down as small as possible is usually the way to go for a lot of GPU algorithms.

Manually writing un/packing code is error prone and takes time and this is the main problem the encoder system solves for gpu the use cases.

Target Features

Features are aspects of the encoder system that change depending on the targets.

Each feature can be:

ENABLED : feature is enabled by the target unless another target says it is unsupported
UNCHANGED : feature is supported by this target but this target does not enable the feature
UNSUPPORTED : feature is not supported by the target and completely disables the feature

VLAs

file: ENABLED
net : UNCHANGED
gpu : UNSUPPORTED
cpu : UNCHANGED

Variable Length Arrays inside structs are only enabled by the target file since they work by encoding a u32 relative offset from where the field is.

For files this is very easy to achieve since we just bump allocate arrays on after the root object. Is is not enabled by net target mode as the packet size is only ~64K, so other techniques just work out better. And for gpu buffers are a single typed so supporting it will only really work if we did an 'untyped' u32 buffer where we reconstruct everything. And I don't see a use case for this at this time.

Text (CSON)

file: ENABLED
net : UNCHANGED
gpu : UNCHANGED
cpu : UNCHANGED

CSON aka. the C-like JSON. Only really has a use when serialising out to a file.

When using text encoding, unions will need to be decorated with #tag & #key directives and the field where the union is used needs a #tag directive. This is so that the generated de/encoder can know what type it is expecting the union be. This is not a problem for binary files, since it just copies the raw bytes.

enum EntType {
    ENT_TYPE_DRAGON,
    ENT_TYPE_PIG,
};

struct DragonEnc {
    u32 something;
};

struct PigEnc {
    u32 something;
};

union EntData #tag EntType {
    DragonEnc dragon #key ENT_TYPE_DRAGON;
    PigEnc    pig #key ENT_TYPE_PIG;
};

struct EntEnc {
    u32(:24, 0, 16777215) some_value;
    EntType(:8)           ent_type;
    EntData               data #tag ent_type;
};

#hex

struct & union fields that are integers can specify that they should be encoded as hexidecimal when being encoding as text (cson):

struct Example {
    u32 mask #hex;
}

Default Values

When reading from a text file, all values might not be present. To solve this you can either do nothing and let it use the default value for that type. Or use default values like so:

struct Example {
    u32 field0 = 123;
    u32 field1; // initialized to 0 
};

struct Example2 {
    Example field; // will use the default values of Example
};

Default values are just parsed as strings so there is no error checking done as part of enc-gen. The string is pasted in as the text decoding code is generated.

Specified Default Values are only supported for:

Basic Types
Enums Types
Vector Types
Packed Types

Unspecified Default Value:

Basic are zero initialized
Enums are zero initialized
Packed Types are zero initialized
Vectors are zero initialized
Quaterneon is identity initialized
Matrix3d is identity initialized
Transform3d is identity initialized
structs follow their field default initializers
unions follow the type of the field that the tag specifies
arrays follow their base type default initializers

#removed

When using text encoding, you are not allowed to remove types or fields from your .enc file. Instead you can used the #removed directive on type declarations, fields and enum values like so:

struct Pig #removed {
    u32 age;
};

enum EntType {
    ENT_TYPE_DRAGON,
    ENT_TYPE_PIG #removed,
};

All this does is append __REMOVED to the identifiers in code generation. The types, fields and values will be kept around so that they can still be used in the upgrade step later on if you change your mind and want to preserve the data.

The benefits of using this is:

the change in identifiers will break code so you can find all uses of it when the field is #removed
when you put the file into #dev mode, it will ask you to actually delete all of your #removed types, fields and values

Versioning

file: ENABLED
net : ENABLED
gpu : UNCHANGED
cpu : UNCHANGED

file target needs versioning so we can preserve assets. net target needs versioning to ensure other clients are on the same version so they can communiate. gpu & cpu targets alone purely work with types that are released in binary code, so no versioning is needed at all.

Version Upgrade

file: ENABLED
net : UNCHANGED
gpu : UNCHANGED
cpu : UNCHANGED

The main purpose for the file target is assets that we either edit in the editor or a baked asset we load from disk. We need to be able to edit these types easily while developing new features but also provide a way for us to upgrade old assets to the new version. net will only need versioning to ensure other clients are on the same version. gpu & cpu targets are purely designed with what is released in binary code, so no versioning is needed at all.

Packed

file: ENABLED
net : ENABLED
gpu : UNSUPPORTED
cpu : ENABLED

For net target mode, you really benefit from packing your structs into as small of the range as possible. All modern compilers support a "packed" struct that removes all padding between fields by aligning each field to 1 byte. This comes at a performance cost when reading & writing to the struct as field are no longer aligned.

Here is an example of using packed struct:

// defined in .enc file
struct Example #packed {
    u8  field0; // span bytes 0..=1
    u32 field1; // span bytes 1..=5
};

// defined in .enc file
struct Example {
    u8  field0; // span bytes 0..=1
    u32 field1; // span bytes 4..=8
};

NON-4 Byte Basic Types

file: ENABLED
net : ENABLED
gpu : UNSUPPORTED
cpu : ENABLED

Since we want support a wider range of GPU hardware we restricted to basic types of u32, s32 and f32. Bitfields will always be a u32 word and vectors types have the same basic type restrictions.

De/Encoding

Binary data is de/encoded in memory inplace using data structure generated by enc-gen that were defined in your .enc files. This means you can directly operate on the data structures by casting the opaque array of bytes into the structure you wish to de/encode.

For the file target mode, this array of opaque bytes will by your root object type, defined with the #root directive in your .enc file. All VLA data will be stored directly after the root object.

After running enc-gen at the bottom of the generated header file you will find a bunch of functions to help decoding & encoding both text and binary files

When encoding Binary files, you will use the BinaryWriter to write out your binary file in memory. Then when you are finished you can call a generated function to verify before saving it to disk. Checking the checksum on a 50MB files takes ~50ms so this might be a thing you want to skip or only do once.

bool binary_verify_no_checksum_V0EditorMaterial(CoreString file_path, ByteView mem, char error_message[static ENC_MESSAGE_SIZE]);
bool binary_verify_with_checksum_V0EditorMaterial(CoreString file_path, ByteView mem, char error_message[static ENC_MESSAGE_SIZE]);

When decoding Binary files, you should run one of verify functions above to make sure your data has been read in properly. If you format has any upgrades, an upgrade function will be generated for you to use. It will need a BinaryWriter just in case there are any upgrades that need to run.

T* binary_upgrade_T(BinaryWriter* w, EncBinHeaderV0* v, char error_message[static ENC_MESSAGE_SIZE]);

When encoding Text (cson) files, you first encode a binary file in memory. Then you call the cson generated function and this will encode the binary data as text before saving it out to disk.

CoreString cson_encode_mem_T(T* obj, CoreLinearAlctor* alctor, char error_message[static ENC_MESSAGE_SIZE]);
bool cson_encode_asset_T(T* obj, CoreString asset_path, CoreLinearAlctor* alctor, char error_message[static ENC_MESSAGE_SIZE]);

When decoding the Text (cson) files, you will need to pass in a BinaryWriter so the cson can be decoded into a binary file in memory.

T* cson_decode_mem_T(BinaryWriter* w, CoreString file_path, CoreString cson, char error_message[static ENC_MESSAGE_SIZE]);
T* cson_decode_asset_T(BinaryWriter* w, CoreString asset_path, char error_message[static ENC_MESSAGE_SIZE]);

De/Encoder Formats

A .enc file needs to choose to be a text and/or binary in the target directive itself. This can only be set once & cannot change.

Data types are only exported as text (cson). This should be use for things like configuration files where you would like the user to be able to edit these in the shipped version of the game:

#target file/... text

Data types are only exported as binary. This is useful for data that simply isn't well suited for text like mesh & pixel data. When used with the versioning feature, it should only be used when for types that do not change that much, as with binary data you will need to up the version for every time you want to make changes to the data layout:

#target ... binary

Data types are exported as text (cson) in development and released in binary. This works extremely well when your data types change a lot and if they are represented well in text form:

#target file/... text/binary

Versioning

The versioning feature only applied when using targets file or net

Updating Types For Text Files

When data is encoded as text files (cson) you can do the following changes without updating the version:

add struct/union fields
add enum values
add type
change struct/union fields order
change enum values
change basic type min or max values
change packed type bits_count, min or max values
change enum underlying type
change default values
change #bitfield(T)
add/remove #packed

But the following will require updating the version:

remove type
change a struct/union field type
rename/remove a struct/union field
rename/remove a enum value names
change #tag or #key names
change #root type

Updating Types For Binary Data

When data is encoded as binary you can do the following changes without updating the version:

nothing

But the following will require updating the version:

add/remove struct/union fields
add/remove enum values
add/remove type
rename a struct/union field
rename a enum value
change a struct/union field type
change basic type min or max values
change packed type bits_count, min or max values
change enum underlying type
change #bitfield(T)
add/remove #packed
change #root type

How Do I Do Versioning?

when you need to make breaking changes to your data structures, you will need to place your .enc file into #dev mode:

#dev upgrade | ups the file version number and you will upgrade from past versions
#dev noupgrade | ups the file version number and all past data will be discarded and started from scratch
#dev amend | keep the same file version number, but you must sure you revert you assets made with this version

While you are in #dev mode, you can change around between them. noupgrade only deletes past data structures when exiting dev mode.

#dev mode is configured like so:

#target ...
#magic ...
#dev ...

For files that export to a text file or (exclusively) binary data. The workflow is quite simple. If you need to make those changes that require you to up version. You put the file into #dev mode then make all of you changes you need then remove the #dev directive when you are done. The version will go up by 1 when entering #dev mode.

When you support encoding to both text files and binary files. It works mostly the same, but when you leave #dev mode, the version number will be increased again by 1. You can think of it as text files will be encoded with even version numbers and binary files will be encoded with odd version numbers. This is so that when we make changes to the development text file, we will not be making changes to the binary file data types.

While you are in #dev mode, you are going to be changing the data types and invalidating any files you have saved out while in #dev mode. To fix this files saved out in dev mode will be suffixed with -devm-HASH where HASH is the hash all the information about data type itself. This will prevent loading any invalid data and allow you to retest the upgrade path from the previous version by just delete the dev files or changing the type itself.

The magic & version number is encoded at the top of the .cson file as an integer like so:

#magic ...
#version ...

For a binary file we will have the following header:

struct EncBinHeaderV0 {
    u32 magic;
    u32 enc_version;
    u32 file_version;
    u32 header_size;
    u64 data_types_hash;
    u64 file_size;
    u64 checksum;
};

The magic number ensures the file is the correct file format. The enc_version number exists so make changes to the EncBinHeader itself in the future if we need to. The file_version number exists so we can avoid loading newer data and upgrade from previous versions. The header_size allows us to skip the header to where the root object starts independent of enc_version. The data types hash ensures that the version wasn't changed on a different branch. This is not yet supported. The file size gives you the full size of the file in bytes. Useful after you have written a file into memory using the BinaryWriter and you can retrieve the file size from the header. The checksum will be a way for us to tell our users that the asset is corrupted. We still need to properly handle invalid data when reading from encoded files for security purposes. The cson file does not need the checksum as the parser will tell us if it is corrupted.

Every generated data type for when you target file will be prefixed with the version number. This is so we can support upgrading using the data types directly:

// defined in .enc file
struct Example {
    u32 field0;
    u32 field1;
};

// generated struct name
typedef struct V0Example V0Example;
struct V0Example {
    u32 field0;
};

// generated struct name
typedef struct V1Example V1Example;
struct V1Example {
    u32 field0;
    u32 field1;
};

Upgrading

When the game tries to decode a file and the version is older the current one. The user creates their own upgrade function that will take the in memory binary representation of the file and emit the new version into a new buffer. This new version will then be saved out in it's place and then the standard manual processing of the binary file continues.

.encdb File

When the file or net target is being used for an .enc file, the versioning future will be enabled. This is achieved by the enc-gen with a database file that is auto generated. The .encdb looks identical to the .enc file but values are explicit and some extra directives exist.

It contains the declaration so that validation can be performed and errors can be reported if the user makes changes that are not allowed. Then when #dev mode is enabled and the user makes edits, the database file will be updated accordingly.

When you have the upgrade feature enabled via by having the file target. The database will also preserve all past types from previous versions of the file. Unless the #dev noupgrade was used that completely removes all past versions.

Upgrades cause the types to be declared multiple times, once for each version they exist in. The #version ... directive acts as a marker for where that version's types begin. The data types hash is encoded in the #version directive so we can check if any of the past versions have been modified at all. Also this can be used to possibly support branching in the future that will merge branched versions.

#version 0 0 15d44abec7e2cf3a

... // types are declared here

#version 0 1 65a93cc548ea89e8

	This is the encoder system I have built for my Game W/O Engine project.

	Have any questions? Best place is the discord: https://discord.gg/FUsK4z97C9

	Want to play around with the system? It is available for Windows & Linux by supporting the project: https://discord.gg/dHrFdYSGjU