Skip to content

Instantly share code, notes, and snippets.

@deltheil
Last active December 29, 2015 19:29
Show Gist options
  • Save deltheil/7718095 to your computer and use it in GitHub Desktop.
Save deltheil/7718095 to your computer and use it in GitHub Desktop.
A (somewhat) detailed description of Tokyo Cabinet B+tree database.

Tokyo Cabinet B+ tree database

Overview

The Tokyo Cabinet B+tree database (a.k.a TCBDB) is implemented on-top of the hash database (TCHDB) that acts as the persistence layer.

Note: the same occurs with Tokyo table database (TCTDB) which is also a wrapper around the TCHDB (the table database can be thought as a document oriented database with querying capabilities).

The TCBDB features (among other things):

  • an in-memory caching layer (for internal and leaves nodes) that can be tuned,
  • an user-defined key comparison function (lexical by default),
  • various iteration schemes via a convenient cursor,
  • record duplication.

Note: record duplication is particularly convenient. See this gist for a concrete illustration, or redisk that uses it to implement Redis list data structure.

Technical Description

A B+ tree DB is made of:

  • Leaves:

    • uint32_t lmemb (a.k.a leaf member): controls the maximum number of records that can be stored into a leaf, i.e. the leaf's page capacity
      • minimum value: 4 (see BDBMINLMEMB)
      • default value: 128 (see BDBDEFLMEMB)
    • uint32_t lsmax: the maximum size of each leaf, i.e. the maximum amount of data that can be recorded into the page
      • default value: 16kb (see BDBDEFLSMAX)
      • minimum value: 0.5kb (see BDBMINLSMAX)
  • Nodes:

    • uint32_t nmemb (a.k.a node member): maximum number of members in each node, i.e. the indices list maximum capacity
      • minimum value: 4 (see BDBMINLMEMB)
      • default value: 256 (see BDBDEFLMEMB)
  • A cache for in-memory storage:

    • TCMAP *leafc: an LRU cache for the leaves
      • uint32_t lcnum: the maximum number of cached leaves
      • default value: 1024 (see BDBDEFLCNUM)
      • minimum value: 64 (see BDBLEVELMAX)
    • TCMAP *nodec: an LRU cache for the internal nodes
      • uint32_t ncnum: the maximum number of cached nodes
      • default value: 512 (see BDBDEFNCNUM)
      • minimum value: 64 (see BDBLEVELMAX)
  • An history array for visited nodes:

    • uint64_t *hist: the history array of visited nodes of size 64 (max level of B+ tree)
    • int hnum: number of element of the history array
  • A comparison function for keys ordering:

    • TCCMP cmp
  • An internal key/value database (a.k.a hash database) for persistent storage:

    • TCHDB *hdb: the underlying Hash DB (both internal and leaves nodes have an unique ID and are stored persistently into this internal DB)
      • bnum: the bucket number of the underlying Hash DB
        • default value: 32749 (see BDBDEFBNUM)
      • xmsiz: the memory mapped size of the underlying Hash DB
        • default value: 0 (see tcbdbnew: tchdbsetxmsiz(bdb->hdb, 0))
      • apow: default alignment power of the underlying Hash DB
      • fpow: default free block pool power of the underlying Hash DB

Behavior vs. records insertion

As more and more (key, value) pairs are inserted, more and more records are added to the leaf's page.

This leads to a leaf divide:

  • if the number of records is higher than lmemb (max records per leaf)
  • if the leaf size is higher than lsmax (max data size per leaf)

Bucket Number Tuning

The bucket number should be computed according to:

  • the total number of records to be stored (or expected if not known),
  • the maximum number of records per leaf.

e.g. for rnum = 1M (key, value)-s total number of records and lmemb = 256 records per leaf, this requires rnum / lmemb leaves to be stored into the underlying database, hence the bucket number should be a multiple, e.g. bnum = 2 x rnum / lmemb.

Using a large multiple (e.g. 8 times) to obtain a large bucket number is recommended to lower the risk of hash values collisions within the underlying DB.

In-Memory Cache Tuning

Increasing the cache size capacity through lcnum and ncnum parameters greatly speeds-up the querying since for each cached leaf/node it allows to load it from the in-memory storage avoiding 2 costly operations:

  • leaf lookup, i.e. search the leaf object corresponding to the input key (see tcbdbsearchleaf)
  • leaf load, i.e. retrieve the leaf by querying the internal DB (see tcbdbleafload)

In other words, to speed-up the runtime querying, it is recommended to choose lcnum and ncnum parameters as high as possible.

If not all nodes can be kept in-memory, one must remember that the internal cache works in an LRU fashion, i.e. the most frequently requested records are kept in-memory.

Useful Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment