Skip to content

Instantly share code, notes, and snippets.

@simbo1905
Last active April 24, 2025 05:58
Show Gist options
  • Save simbo1905/1bbc1e355b00f4130d5ccded5a370d86 to your computer and use it in GitHub Desktop.
Save simbo1905/1bbc1e355b00f4130d5ccded5a370d86 to your computer and use it in GitHub Desktop.
so Claud, you got this, right?
in our implimentation of `static <T> Pickler<?> createPicklerForSealedTrait(Class<T> sealedClass)` when ever we see a
record we write out classname of the records so that we can resolve the correct pickler to deseralize it. we have
tests of complex trees where we have the same type of node many times. that bloats the final format. we want to be
future proof to new records being added to the sealed trait between serialization and deserialization. so we cannot
use a fixed map of permittted types to bytes to make the format very compact. yet there is no reason to write out a
classname to the buffer twice. instead when we currently write out a classname we can memorize the offset in the
bytebuffer where we are about to write the size of the classname then the bytes. this can be recorded in a map of
class-to-offset called classNameToOffset. then when we are about to write out a new record we can check the keys in
the map. if we have not yet written out the classname to the current buffer we can write out as normal. yet if we have
written out the classname we can write out a special marker. we can write out the `~ classNameToOffset( clazz )` which
is a negative number. then when we deserialize before we do anything we can note down the current position in the
bytebuffer. now when we read back the length as as an int if it is positive number we do what we do now and read out
the classname. we put that into a map of offset-to-className computing the offset as the position just before we started
the read minus the original offset in the buffer where we read the class. if we are only doing a serde of a single record
we never use this information. yet if we are had written the same record types more than once we will read in a byte that
is a negative number. this is a marker that refers to a classname we have seen before. so we take the `~` of it to get the prior offset relative to the beginning where we saw the same classname before. now we can lookup the classname from the map
to get the pickler. now i mentioned createPicklerForSealedTrait yet we have the same sort of behaviour with
`static <R extends Record> Pickler<R> createPicklerForRecord(Class<R> recordClass)` if we have nested records. that will
also pickle the className of inner records. so we can use the same technique. in this manner we will get a much more
compact binary representation for object graphs of records or when we are writing may messages in a protocol that are
permitted classes in the same sealed trait family. now we should note that at the moment or picklers should be threadsafe.
this is because they just flatten into and out of a bytebuffer. what we cannot do when we write the new code is use a
member variable in the current instance of the pickler to memorize the offsets to/from classNames. we have to instantiate
the maps at entry points to seralize/deserialze and pass them down the call chain so that they are added to and read from
by the current thread as local variables passed down the thread callstack. now we do not want to bloat our interfaces of
our code. we can keep the curent public api of our Pickler. we can use static methods that are passed the maps and the
bytebuffer that do the work. now i have given you an extremely long explaination. DO NOT WRITE THE IMPLIMENTATION as it
will be too long. just write me down some plain text paragraphes that summarise what i am saying. then list of the steps
as plain text instructions of what we need to do. do not write the actual code. just name the methods that will be changed, added/removed using only their signatures. this aim is that you are writing a prompt to a weaker coding LLM so that it you
are the architect and it is the junior programmer.
# Binary Format Optimization for Record Serialization
You're proposing a class name deduplication optimization to reduce the size of serialized data when the same record types appear multiple times in an object graph or message stream.
## Summary of the Approach
- Currently, each time we serialize a record, we write its full class name to the buffer
- This is inefficient when the same record type appears multiple times
- The optimization would track already-written class names and their positions in the buffer
- For subsequent occurrences, write a reference (negative number) instead of repeating the class name
- During deserialization, maintain a mapping of positions to class names for lookups
## Implementation Steps
1. Create thread-local maps for tracking:
- During serialization: Map<Class<?>, Integer> to store class name to buffer offset
- During deserialization: Map<Integer, String> to store offset to class name
2. Modify serialization process:
- Before writing a class name, check if already written
- If new, write normally and store position in map
- If seen before, write negative number reference to previous position
3. Modify deserialization process:
- When reading class name length:
- If positive, read class name normally and add to offset map
- If negative, decode as reference to previous position and lookup class name
4. Methods to modify:
- `serialize(T object, ByteBuffer buffer)` in sealed trait pickler
- `deserialize(ByteBuffer buffer)` in sealed trait pickler
- `createPicklerForSealedTrait(Class<T> sealedClass)`
- `createPicklerForRecord(Class<R> recordClass)`
5. New helper methods needed:
- `private static void writeClassName(Class<?> clazz, ByteBuffer buffer, Map<Class<?>, Integer> classNameToOffset)`
- `private static String readClassName(ByteBuffer buffer, Map<Integer, String> offsetToClassName, int initialPosition)`
6. Ensure thread-safety by never storing state in the pickler instances themselves
This approach maintains the current API while significantly reducing the size of serialized data when working with complex object graphs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment