Riak Source Code Reading #27: Riak KV backend API and storage

The Riak KV backend store API

behaviour_info(callbacks) ->
    [
     {api_version,0},
     {capabilities, 1},  % (State)
     {capabilities, 2},  % (Bucket, State)
     {start,2},          % (Partition, Config)
     {stop,1},           % (State)
     {get,3},            % (Bucket, Key, State)
     {put,5},            % (Bucket, Key, IndexSpecs, Val, State)
     {delete,4},         % (Bucket, Key, IndexSpecs, State)
     {drop,1},           % (State)
     {fold_buckets,4},   % (FoldBucketsFun, Acc, Opts, State),
                         %   FoldBucketsFun(Bucket, Acc)
     {fold_keys,4},      % (FoldKeysFun, Acc, Opts, State),
                         %   FoldKeysFun(Bucket, Key, Acc)
     {fold_objects,4},   % (FoldObjectsFun, Acc, Opts, State),
                         %   FoldObjectsFun(Bucket, Key, Object, Acc)
     {is_empty,1},       % (State)
     {status,1},         % (State)
     {callback,3}];      % (Ref, Msg, State) ->
behaviour_info(_Other) ->
    undefined.

A very simple version of the 'yessir' backend

First, look at this simple version of the 'yessir' back end.

The 'yessir' back end is only used for debugging and for performance analysis.
The 'yessir' back end does not store anything. It always lies.
The intent is to be very simple and fast.

Early version of riak_kv_yessir_backend.erl, June, 2012

Being fast /= being short & simple code.
- Today's version of riak_kv_yessir_backend.erl is much more complicated.
- ... but it is faster and it has more features.

Let's look at the 'memory' backend

This code is more complicated than the 'yessir' backend. Why?

It actually has a real memory.
It actually supports Riak KV vnode fold operations
- fold_buckets, fold_keys, fold_objects
- syncronous fold, asyncronous fold
- The fold API looks very simple & easy to use...
- The fold API internal plumbing is ... not pretty.
It supports Riak KV's 2I ("secondary indexes")
It supports age limits ("time to live" or "TTL")
It supports memory size accounting (instead of using ETS counters)
It supports maximum memory limit:
- If limit is exceeded, then delete oldest object(s)

Here is the version of today's develop branch of the 'memory' backend: https://github.com/basho/riak_kv/blob/8ebfeb6700b451772e779c5f3c5601ca1ab4ec92/src/riak_kv_memory_backend.erl

Note #1 about folding

Very nice, pure, and easy:

lists:foldl(fun(_X, Count) -> Count + 1 end,
            0,
            [a, b, c, d, e, f, 42.0, "yo, kewl"]).

The backend and vnode folding funcs are not pure and easy. (sadpanda)

Note #2 about folding: general patterns & themes

General pattern: filter

filter_wrapper_fun(Item, InnerFun, Acc) ->
    case is_this_item_invisible(Item) of
        true ->
            Acc;
        false ->
            InnerFun(Item, Acc)
    end.

General pattern: transform

transform_wrapper_fun(Item, InnerFun, Acc) ->
    RealItem = transform_somehow(Item),
    InnerFun(RealItem, Acc).

General pattern: transform from riak_kv_vnode.erl

handle_command(?FOLD_REQ{foldfun=FoldFun, acc0=Acc0,
                         forwardable=_Forwardable, opts=Opts}, Sender, State) ->
    FoldWrapper = fun(Bucket, Key, Value, Acc) ->
                          FoldFun({Bucket, Key}, Value, Acc)
                  end,

Now, let's look at the 'memory' backend's fold functions: buckets, keys, objects, ...

What is a backend "folder" function?

A backend "folder" is a function that implements a backend's fold.
- ... Or like lists:foldl() implements a lists fold.
- ... Just like ets:fold() implements the ETS fold.
General folder pattern: stop the fold early

perhaps_stop_early_wrapper_fun(Item, InnerFun, Acc) ->
    try
        InnerFun(RealItem, Acc)
    catch
        stop_fold ->
            Acc
    end.

%%% ...

my_fold_function({Key, _Val}=BKey, InnerFun, Acc) ->
    if
       Key >= <<"n">> ->    % Avoid keys >= "n..."
           throw stop_fold;
       true ->
           InnerFun(BKey, Acc)
    end.

Bitcask, the Riak KV backend

Source code: https://github.com/basho/riak_kv/blob/develop/src/riak_kv_bitcask_backend.erl

This is the glue code between Riak KV and the Bitcask app.

Some additional things that the Bitcask backend does that the memory backend does not:

Merge operations, to reclaim unused/"dead" disk space.
Check for data upgrade/file format conversions

Bitcask, the app

See also: the original Bitcask design summary, April 2010.

This design doc is still mostly correct ... many small details have changed in the last 4 years, though.

Part 1: the "keydir" is stored in RAM always.
- When a bitcask is opened, the keydir is read from disk -> RAM.
- You must have enough RAM to store all keys.
Part 2: the data on disk.
- Key + value data is stored on disk

Bitcask, part 1: the keydir in RAM

The keydir is a key/value store
- key: same as the Bitcask key
- value: where on disk to find the Bitcask value
  - file_id #, value size, value offset, timestamp
The keydir is a hash table
- Key order is not preserved
- Key iteration is done via fold
- Severe limits on the # of simultaneous folds
The keydir is implemented as a NIF
- NIF <-> C code

Bitcask, part 2: data files on disk

Bitcask files are append-only.
To delete or replace data, new records are appended to the active file.
"Dead data" lives on disk for a period of time...
... until "merge" can copy "alive data" to a new data file.
- For each copied old key + value, update the RAM keydir
- When all "alive data" is copied, then delete the old data file.
Only two files may be open at one time:
- The active file: new writes only
- (maybe) The merge file: old data copied from old files

slfritchie/scr27.md