Skip to content

Instantly share code, notes, and snippets.

@webmaster128
Last active November 5, 2023 21:35
Show Gist options
  • Save webmaster128/064f3e69c3c4b19a9181f863647bb8eb to your computer and use it in GitHub Desktop.
Save webmaster128/064f3e69c3c4b19a9181f863647bb8eb to your computer and use it in GitHub Desktop.
Reducing size of CosmWasm storage key encoding

Length prefixes key namespacing, version 2

In https://github.com/CosmWasm/cosmwasm/blob/main/docs/STORAGE_KEYS.md you can see the standard storage key encoding used in CosmWasm from 2020-2023. In the meantime we figured out that using short keys is important for the performance of the underlying database. How much it affects performance, I don't know. But the more I learn about how those databases and their caching works, the more I'm convinced this is the right direction.

CosmWasm developers love performance. It allows them to build things that are hard or impossible to build in other systems. We want to make it easy for our developers to reach peak performance.

The encoding

Version 1 of Length prefixed key namespacing as initially discussed in webmaster128/key-namespacing has a few inefficiencies:

  1. Each key component supports a length of up to 65535 bytes (64 KiB), which is way too much to be efficient at scale. We should discourage such large keys.
  2. 2 byte encoding of the length means that every component is of size component.len() + 2, which is a lot of overhead for short components.

This is particularly wasteful if you try to build a storage efficient data model. See e.g.

// Map from ID to address
const FASTMAP: Map<u16, Addr> = Map::new("f");

in which an example key looks like this

  b"\x00\x01f\x00\x02\xA1\xB2"
//  ^^^^^^^^                  two byte length
//          ^                 one byte "f"
//           ^^^^^^^^         two byte length
//                   ^^^^^^^^ two bytes 41394u16

making it 7 bytes to encode "f"/41394. This overhead will be present in every single database entry.

The first obvious improvement would be reducing the length encoding from two bytes to one, reducing the max length of a component to 256 bytes. This would bring out 7 bytes key down to 5 bytes. But usually we don't need keys that long. A varint can be used to have a short encoding for the most common small values. But why not write components with 1 byte length without length prefix? The first ~half of the one byte range can be reserved for those 1-byte keys. The other half can hold a component length values.

To accomodate 1-byte keys and a length value for larger keys we use the following schema (decoding):

Next byte b (u8) Interpretation
0 - 126 len: 1; value: set to b
127 - 249 len: b-127: value: read len more bytes
250 len: next byte + 127: value: read len more bytes
251 len: next byte + 127 + 1 * 256: value: read len more bytes
252 len: next byte + 127 + 2 * 256: value: read len more bytes
253 len: next byte + 127 + 3 * 256: value: read len more bytes
254 len: next byte + 127 + 4 * 256: value: read len more bytes
255 len: next byte + 127 + 5 * 256: value: read len more bytes

This allows key components from 0 bytes to 1662 bytes but favours the encoding of 1 byte compoents with values from NULL to "~", including all printable ASCII characters. This allows for enumerated top level keys of up to 127 elements to be stored most efficiently.

The above example would then be encoded as 4 bytes only:

   b"f\x02\xA1\xB2"
//  ^             one byte "f"
//   ^^^^         one byte length
//       ^^^^^^^^ two bytes 41394u16

The length can be encoded as 0, 1 or 2 bytes:

key component Key length encoding
length 1 and value <= 126u8 0 bytes
length <= 122 1 byte
length <= 1662 2 bytes
length > 1662 unsupported

Benefits

  • Slightly shorter keys on average
  • Much shorter keys in optimized scenarios

Drawbacks

  • Slightly more complicated design as one byte can be value or length

Neutral

  • Reduces the max component length from 64KiB to 1662 bytes. This can be limiting but if you hit this limit you are probably doing something wrong anyways. Longer keys are possible if spread across multiple key components.

Further remarks

  1. The above encoding is always more compact or equal to the previous one. There are no cases in which the encoding becomes longer. This is primarily due to the length restriction.
  2. The encoding can be implemented using the same API as right now, leading to no inconvenience for users
  3. Top key enum solutions as shown in https://github.com/noislabs/nois-contracts/pull/268/files can be streamlined by a functions or macros. It gives you a convenient way for 1 byte key components without the risk of collisions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment