kellabyte kellabyte

HoloFast: a kernel assisted eBPF fast path for accelerating Accord consensus

Core idea

Inspired by Electrode, HoloFast is an Accord acceleration layer for HoloStore’s consensus library, implemented partly as a small fixed-frame eBPF protocol running in the Linux kernel. It lets the kernel take one outgoing Accord message and fan it out to multiple replicas, then watch replies and notify Rust only when enough replicas have responded for quorum. The important Accord decisions stay in Rust; eBPF only handles repetitive packet work like fan-out, fan-in, duplicate filtering, and steering.

Electrode saw up to 128.4% higher throughput and 41.7% lower latency by moving repetitive fan-out and quorum-waiting work into eBPF; HoloFast applies that idea to HoloStore’s Accord path.

1. Background: what Electrode proved and why it matters

Bifrost First Run

Bifrost is a DuckDB extension that bridges OLTP (TursoDB) and OLAP (DuckDB) worlds through a single ACID transaction layer by storing OLAP native Parquet segments in TursoDB.

Bifrost is written by Codex with approxiamtely 24h of compute time and we completed milestone 3 from the project plan. Bifrost contains zero human written code to experiment with what is possible. The decisions along the way have been heavily human directed however.

The project plan was created from my (@kellabyte) requirements via Codex. This document is written only by me with no help from AI.

Query Routing

Bifrost supports 2 ways of routing queries.

Auto: Bifrost detects if the query would benefit from an OLAP orientation. A query to table events would route to the events_oltp table.

name	start	end	duration	reads	writes	cpu
foo0	2024-10-03 14:45:38.773000-04:00	2024-10-03 14:45:38.773000-04:00	1026	124	0	0
foo1	2024-10-03 14:45:46.380000-04:00	2024-10-03 14:45:46.380000-04:00	85	0	0	0
foo2	2024-10-03 14:45:46.380000-04:00	2024-10-03 14:45:46.380000-04:00	195	2	0	0
foo3	2024-10-03 14:45:46.380000-04:00	2024-10-03 14:45:46.380000-04:00	19	0	0	0
foo4	2024-10-03 14:45:46.380000-04:00	2024-10-03 14:45:46.380000-04:00	136	2	0	0
foo5	2024-10-03 14:45:46.383000-04:00	2024-10-03 14:45:46.383000-04:00	16	0	0	0
foo6	2024-10-03 14:45:46.383000-04:00	2024-10-03 14:45:46.383000-04:00	118	0	0	0
foo7	2024-10-03 14:45:54.440000-04:00	2024-10-03 14:45:54.440000-04:00	44	0	0	0
foo8	2024-10-03 14:45:54.440000-04:00	2024-10-03 14:45:54.440000-04:00	287	3	0	0

History

For a long time I've been really impacted by the ease of use Cassandra and CockroachDB bring to operating a data store at scale. While these systems have very different tradeoffs what they have in common is how easy it is to deploy and operate a cluster. I have experience with them with cluster sizes in the dozens, hundreds, or even thousands of nodes and in comparison to some other clustered technologies they get you far pretty fast. They have sane defaults that provide scale and high availability to people that wouldn't always understand how to achieve it with more complex systems. People can get pretty far before they have to become experts. When you start needing more extreme usage you will need to become an expert of the system just like any other piece of infrastructure. But what I really love about these systems is it makes geo-aware data placement, GDPR concerns potentially simplified and data replication and movement a breeze most of the time.

Several years ago the great [Andy Gross](ht

If I run 20 haywire processes using tcmalloc haywire reaches 6.3 million requests/second.

killall hello_world; for i in `seq 20`; do LD_PRELOAD="./lib/gperftools/.libs/libtcmalloc.so" ./build/hello_world --balancer reuseport & done

perf top
   7.42%  hello_world              [.] http_request_buffer_pin
   6.94%  hello_world              [.] http_request_buffer_reassign_pin
   6.63%  hello_world              [.] http_parser_execute
   6.23%  libtcmalloc.so.4.3.0     [.] tc_deletearray_nothrow
   4.94%  hello_world              [.] http_request_buffer_locate

3 million requests/second

./build/hello_world --threads 20 --balancer reuseport
perf top

   9.76%  hello_world              [.] http_parser_execute
   7.85%  libc-2.21.so             [.] malloc
   4.50%  libc-2.21.so             [.] free
   3.43%  libc-2.21.so             [.] __libc_calloc

FSQual run on Linux subsystem for Windows

fsqual - file system qualification tool for asynchonus I/O
https://github.com/avikivity/fsqual

./fsqual
context switch per appending io (iodepth 1): 0 (GOOD)
context switch per appending io (iodepth 3): 0 (GOOD)
context switch per appending io (iodepth 3): 0 (GOOD)

Max throughput benchmark

Max throughput in master

./bin/wrk/wrk --script ./pipelined_get.lua --latency -d 5m -t 40 -c 760 http://server:8000 -- 32
Running 5m test @ http://server:8000
  40 threads and 760 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.16ms    6.70ms 223.68ms   93.20%
    Req/Sec   114.20k    28.04k  410.56k    72.96%

Why does this work on Rubular.com but not in Ruby code? What am I doing wrong?

Source string

Running 1s test @ http://192.168.0.2:8000\n  8 threads and 256 connections\n  Thread Stats   Avg      Stdev     Max   +/- Stdev\n    Latency     2.08ms    3.91ms  61.86ms   92.89%\n    Req/Sec    23.80k    10.45k   60.63k    74.70%\n  Latency Distribution\n     50%    1.10ms\n     75%    1.59ms\n     90%    4.03ms\n     99%   22.04ms\n  197510 requests in 1.10s, 29.76MB read\nRequests/sec: 179561.40\nTransfer/sec:     27.06MB

Ruby regex

output.match("Requests\/sec: (.*)\\n")

	BenchmarkCRC32/1-8 100000000 15.80 ns/op 63.46 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/2-8 100000000 15.80 ns/op 126.58 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/4-8 100000000 15.90 ns/op 250.87 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/8-8 100000000 16.00 ns/op 498.77 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/32-8 100000000 18.40 ns/op 1738.69 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/64-8 100000000 21.00 ns/op 3053.22 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/128-8 50000000 26.50 ns/op 4823.16 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/256-8 50000000 38.80 ns/op 6596.60 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/512-8 30000000 51.00 ns/op 10037.68 MB/s 0 B/op 0 allocs/op
	BenchmarkCRC32/1024-8 20000000 84.90 ns/op 12055.07 MB/s