Skip to content

Instantly share code, notes, and snippets.

@creachadair
Last active January 15, 2024 17:49
Show Gist options
  • Save creachadair/359e9133939b38a8ed02c4200f726df0 to your computer and use it in GitHub Desktop.
Save creachadair/359e9133939b38a8ed02c4200f726df0 to your computer and use it in GitHub Desktop.
A self-framing packed binary format for unsigned integer lengths

Self-Framing Packed Binary Lengths

We describe a variable-width binary encoding format for unsigned integer lengths that handles values up to 1073741823 $(2^{30} - 1)$ and does not require complex bit manipulation to pack and unpack.

In summary: The input is encoded as little-endian, with the excess length of the encoding packed into the lowest-order 2 bits of the value. This makes the encoding self-framing, as the decoder can consume the first byte and immediately know how many additional bytes must be read to recover the original value. The only operations required are integer addition, multiplication, and division/remainder by powers of 2 (specificially, 4 and 256).

The algorithm below is described for 32-bit unsigned integers, but it can be extended trivially to 64-bit integers by using the lowest-order 3 bits instead of 2, or contracted to 16-bit integers using the lowest-order 1 bit.

A Go implementation

Encoding

  • Input: An unsigned 32-bit integer $0 ≤ n < 1073741824$.
  • Output: An array of 1 to 4 bytes encoding $n$.

Procedure:

  1. Let $v_1 = n\cdot 4$.
  2. Let $v_2 = n + s$ where
    • $s = 0$ if $n < 64$, else
    • $s = 1$ if $n < 16384$, else
    • $s = 2$ if $n < 4194304$, else
    • $s = 3$.
  3. Encode $v_2$ into an array $B$ of 4 bytes in little-endian order (least-significant byte first).
  4. Return he first $s+1$ bytes of $B$ as output.

Notes: The value in (1) cannot overflow because $n < 2^{30}$. Step (2) can also be used in isolation to precompute the encoded size of $n$ which is $s+1$ bytes.

Decoding

  • Input: An array $B$ of bytes.
  • Output: The value $n$ encoded by the prefix of $B$ and the length of the encoding; or else an error.

Procedure:

  1. If $|B| = 0$ ($B$ is empty), report an error ("empty input").
  2. Let $s = B[0]\ \%\ 4$.
  3. If $|B| < s+1$, report an error ("truncated encoding").
  4. Let $v = 0$
  5. For each $i \in (s, s-1, \ldots, 0)$, set $v \leftarrow (v\cdot 256) + B[i]$.
  6. Return $n = v\ //\ 4$ (the encoded value) and $nb = s+1$ (the length of the encoding).

Notes: The symbol $\%$ denotes the integer remainder operation, and $//$ denotes integer division discarding remainder, so that $a\ //\ b \equiv \lfloor a / b \rfloor$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment