Modern hardware trends have made it feasible to combine large in-memory buffers with fast SSD storage to achieve both high performance and scalability. DRAM capacity growth has slowed and remains expensive, while SSDs have become very fast (multi-GB/s read throughput) and much cheaper per byte (p29-neumann-cidr20.pdf). For example, a 2 TB NVMe SSD can stream ~3.5 GB/s and costs a few hundred dollars, whereas 2 TB of DRAM costs tens of thousands (p29-neumann-cidr20.pdf). As a result, pure in-memory DBMSs, while offering top performance, are increasingly uneconomical and limited in capacity (p29-neumann-cidr20.pdf). A more cost-effective approach is to use a sizable RAM buffer for the hot working set and SSDs for cold data, yielding near in-memory performance at a fraction of the cost (p29-neumann-cidr20.pdf). We “wholeheartedly agree with this notion” (p29-neumann-cidr20.pdf) and present Umbra, a database system designed to deliver genuine in-memory performance for cached data while transparently scaling to SSD storage when data exceeds RAM (p29-neumann-cidr20.pdf).
Umbra is the successor to the HyPer in-memory system and removes HyPer’s main limitation: all data no longer must fit in memory (p29-neumann-cidr20.pdf). Crucially, Umbra achieves this without sacrificing performance when the working set does fit in RAM (p29-neumann-cidr20.pdf). Umbra retains key design aspects of HyPer (e.g. compiled query execution) but diverges in many low-level mechanisms necessary for efficient external-memory operation (p29-neumann-cidr20.pdf). In particular, Umbra introduces a novel SSD-conscious buffer manager with variable-size pages that provides disk-based storage with negligible overhead, rivaling an in-memory system for cached data (p29-neumann-cidr20.pdf). This addresses a major shortcoming of traditional disk-based systems: fixed-size pages simplify the buffer pool but make it difficult to store large objects (long strings, large binary data, etc.) without complex upper-layer logic (p29-neumann-cidr20.pdf). By contrast, Umbra’s variable-page design can accommodate large objects in one piece, avoiding the expensive object fragmentation and assembly mechanisms required in other systems (p29-neumann-cidr20.pdf).
In summary, Umbra’s contributions include a high-performance, variable-page buffer manager (Section 2) (p29-neumann-cidr20.pdf), along with enhancements to other DBMS components (string storage, statistics, query execution) to gracefully handle data larger than memory (Section 3) (p29-neumann-cidr20.pdf). We demonstrate through experiments that Umbra matches or exceeds pure in-memory systems on in-memory workloads and handles out-of-memory cases efficiently (Section 4). We also compare Umbra’s design to related systems like HyPer and LeanStore, and conclude with insights for building the next generation of high-performance disk-based systems.
Prior research showed that conventional buffer managers are a major bottleneck on modern hardware (p29-neumann-cidr20.pdf). Our team’s recent LeanStore project addressed many of these inefficiencies, achieving performance close to an in-memory system when the working set fits in RAM (p29-neumann-cidr20.pdf). Umbra builds on the fundamental ideas of LeanStore (highly scalable, low-overhead buffering) while extending them to support variable-size pages (p29-neumann-cidr20.pdf). This section details Umbra’s buffer manager design, which allows flexible page sizes and exploits operating system virtual memory features to eliminate fragmentation and overhead. We describe the memory management scheme, pointer swizzling for fast address translation, an optimistic concurrency mechanism (versioned latches), and how relations are stored and accessed via the buffer pool. We also touch on how recovery (ARIES logging) is adapted to variable page sizes.
Umbra organizes pages into size classes, each class doubling the page size of the previous (exponentially growing sizes) (p29-neumann-cidr20.pdf). Size class 0 is the smallest page (chosen as 64 KiB in Umbra) (p29-neumann-cidr20.pdf). Class 1 is 128 KiB, class 2 is 256 KiB, and so on, up to potentially the entire buffer pool size (p29-neumann-cidr20.pdf). In practice, the largest pages remain much smaller than the buffer pool, but this design permits very large pages if needed. All pages, regardless of size, reside in a single unified buffer pool of configurable total size (p29-neumann-cidr20.pdf). By default, Umbra allocates up to half of physical memory for the buffer pool, leaving the rest for query execution memory (p29-neumann-cidr20.pdf). Unlike earlier variable-page systems (e.g. an old ADABAS variant [18]), Umbra does not require statically partitioning memory per page size class (p29-neumann-cidr20.pdf). The buffer pool memory is shared among all page sizes, avoiding the need to guess memory proportions per class in advance (p29-neumann-cidr20.pdf). Externally, the buffer manager’s interface looks like a traditional one: components pin pages (loading from disk if not already in memory) and unpin them when done, allowing eviction if needed (p29-neumann-cidr20.pdf). Internally, however, Umbra’s memory management is novel.
The chief challenge with multiple page sizes in one pool is external fragmentation – large pages could break up the memory area and leave unusable gaps. Umbra sidesteps this by exploiting the flexible mapping between virtual and physical memory provided by the OS (p29-neumann-cidr20.pdf). Modern operating systems maintain a page table that translates a process’s virtual addresses to physical RAM addresses on demand (p29-neumann-cidr20.pdf). This mechanism allows contiguous virtual memory to be backed by non-contiguous physical pages and even allows reserving virtual address ranges without immediately consuming physical memory (p29-neumann-cidr20.pdf). Umbra leverages these features to create a buffer pool that is contiguously addressed in each size class but does not waste physical memory.
Specifically, for each page size class, Umbra reserves a large virtual memory region via the mmap()
system call (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). Each region is as big as the entire buffer pool (in theory), ensuring enough address space to hold all buffer pages of that class (p29-neumann-cidr20.pdf). We use a private anonymous mapping, so no physical memory is allocated up front – the kernel simply marks that virtual address range as reserved (no backing) (p29-neumann-cidr20.pdf). We then partition each region into chunks equal to the page size for that class (p29-neumann-cidr20.pdf). For each chunk, the buffer manager creates a buffer frame descriptor containing a pointer to that chunk’s start address (p29-neumann-cidr20.pdf). These frame pointers serve as stable home addresses for pages: a given page (by page ID and size class) always maps to a fixed virtual address in its class’s region (p29-neumann-cidr20.pdf). As a result, there is no virtual address fragmentation within a size class – pages of class i live in disjoint, pre-reserved address ranges. The backing physical memory for those addresses will be allocated on demand when pages are loaded, and may be fragmented, but that doesn’t affect our ability to use contiguous virtual addresses.
When the system needs to load a page from disk, it finds an available buffer frame for that page’s size class and issues a pread()
call to read the page data directly into the frame’s fixed virtual address (p29-neumann-cidr20.pdf). At that moment, the OS page-faults the virtual addresses and populates them with physical RAM, bringing the data in (standard demand paging backed by our database file). Conversely, when a dirty page is evicted, Umbra first writes it back to storage with pwrite()
(ensuring the disk copy is up-to-date), then frees its physical memory by advising the kernel that we no longer need those addresses (p29-neumann-cidr20.pdf). We use the madvise(..., MADV_DONTNEED)
call to flag the frame’s pages as discardable (p29-neumann-cidr20.pdf). This immediately frees the physical memory (allowing the OS to reuse it) but crucially does not alter the virtual address mapping (p29-neumann-cidr20.pdf). The frame’s address remains reserved and will now act as a zero-filled page if accessed while unloaded (on Linux, reads from a DONTNEED region return zero bytes until repopulated) (p29-neumann-cidr20.pdf). This behavior is important for our optimistic concurrency (discussed later) and ensures that using large virtual regions never causes the process’s resident set to exceed the configured buffer pool size (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). Because our mmap
mappings are anonymous (not file-backed), using madvise
has essentially no overhead beyond marking pages unused (p29-neumann-cidr20.pdf).
Umbra currently treats the storage as a standard block device (e.g. an SSD presented by the OS), using pread/pwrite
for simplicity (p29-neumann-cidr20.pdf). This design imposes no constraints on the underlying hardware and relies on the OS for I/O scheduling and caching. It would be possible to interface more directly with advanced storage (like open-channel SSDs that expose their internal parallelism) for further optimization (p29-neumann-cidr20.pdf), but that is left for future work. At runtime, the buffer manager tracks the total size of pages currently loaded in RAM and evicts pages when needed to keep usage under the buffer pool limit (p29-neumann-cidr20.pdf). Umbra adopts LeanStore’s replacement policy: pages are speculatively unpinned as soon as possible and placed into a FIFO “cooling” queue rather than being evicted immediately (p29-neumann-cidr20.pdf). If a page is accessed again while in the cooling stage, it can be re-pinned cheaply (it hasn’t actually been removed yet) (p29-neumann-cidr20.pdf). Pages that reach the end of the queue without being accessed are finally evicted to free memory. This acts similarly to an LRU with second-chance, but implemented with simple FIFO queues for scalability (p29-neumann-cidr20.pdf).
Overall, this memory management scheme lets Umbra reap the benefits of variable-size pages without fragmentation or high complexity. We effectively use the OS’s virtual memory system to create dynamic “slots” for pages of different sizes, eliminating external fragmentation entirely (p29-neumann-cidr20.pdf). The overhead for mapping and unmapping pages is minimal, and we rely on the OS to handle physical allocation optimally. With basic replacement in place, Umbra’s buffer manager achieves its goal: supporting huge objects and mixed page sizes with little runtime cost. Next, we discuss additional optimizations (inherited and extended from LeanStore) that remove bottlenecks in address translation and synchronization.
When pages can reside on disk or in memory, references to pages cannot be direct pointers all the time; they often must be indirect via a page identifier (PID) that can be looked up to find the page’s memory location. A naive global mapping table from PIDs to memory addresses would become a concurrency hotspot on modern many-core systems (p29-neumann-cidr20.pdf). Instead, Umbra employs pointer swizzling, a decentralized address translation technique adapted from LeanStore (p29-neumann-cidr20.pdf). In Umbra, any reference to a page (within an index node, for example) is stored in a 64-bit field called a swip (swizzled pointer) (p29-neumann-cidr20.pdf). This 64-bit value encodes either a direct memory address if the target page is currently in RAM, or the page’s disk identifier if it is not (p29-neumann-cidr20.pdf). A swizzled swip means it holds a pointer, and an unswizzled swip means it holds a PID.
We use pointer tagging to distinguish these cases with minimal overhead. Specifically, Umbra reserves the least significant bit of the 64-bit value as a flag (p29-neumann-cidr20.pdf). All aligned pointers have their LSB = 0 (because addresses of 8-byte aligned blocks end in binary ...000
), so we denote LSB=0 as the “swizzled” state (p29-neumann-cidr20.pdf). If a swip is unswizzled (pointing to a disk page), we set this bit to 1, and use the remaining upper bits to store the page’s identity (p29-neumann-cidr20.pdf). In the current format, 6 bits store the page’s size class and 57 bits store the page number within that class (p29-neumann-cidr20.pdf). This is enough address space for 2^57 pages of each size class. With this encoding (shown in Figure 2 of the paper), the buffer manager can interpret a swip by checking one bit: if LSB=0, the value is a direct pointer to a memory-resident page; if LSB=1, the value encodes (class, PID) and the buffer manager will load that page into memory on demand (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). No centralized lookup table is required – the information needed to fetch the page from disk is contained in the swip itself (p29-neumann-cidr20.pdf). The translation to a memory pointer is just a conditional branch and bit-mask operation, making swizzling extremely fast (p29-neumann-cidr20.pdf).
This decentralized scheme does complicate some operations, particularly page eviction. If a page is evicted from memory, any swip that was pointing to it must be “unswizzled” (converted back to a PID) so that future accesses know it’s on disk. In a system where many references to the same page could exist (e.g. a page could be pointed to by multiple parent nodes or sibling links), finding and updating all those swips would be hard or impossible to do correctly and efficiently (p29-neumann-cidr20.pdf). Umbra’s solution is to constrain how pages can be referenced: all buffer-managed data structures must form a tree (or hierarchy) such that each page is referenced by exactly one owning swip (p29-neumann-cidr20.pdf). In other words, a page has a single parent pointer (except the root which is known by higher-level context), and we disallow any structure where multiple pointers refer to the same page. For example, in Umbra’s B+-trees, leaf nodes do not have pointers to their siblings (contrast with some classic B+-tree designs) because a sibling link would create a second reference to that leaf (p29-neumann-cidr20.pdf). By ensuring a unique parent reference, when a page is evicted we know exactly one location to update (the parent’s swip) to unswizzle it. This makes pointer swizzling manageable and consistent. Umbra sacrifices some convenience (like quick lateral traversal via sibling pointers) for this simplicity, but as we will see, we can still implement efficient scans without sibling pointers using a latch-coupling technique (p29-neumann-cidr20.pdf).
With pointer swizzling, Umbra avoids a global mapping structure entirely. Each page pointer in memory is updated in place to a PID on eviction and vice versa on load, with atomic operations. The check for a swizzled vs unswizzled pointer is just one bit check on access. This design, borrowed from LeanStore, distributes the address translation work and removes a major scalability bottleneck (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf).
Another challenge for high-performance concurrency is the overhead of latches (lightweight locks) for synchronizing access to pages in the buffer pool. Traditional latch coupling or heavy mutex use can serialize threads and hurt scalability on many-core systems (p29-neumann-cidr20.pdf). LeanStore addressed this with optimistic latching, allowing readers to access pages without taking a heavy lock and only validating afterward (p29-neumann-cidr20.pdf). Umbra adopts and extends this approach by using versioned latches on each buffer frame (p29-neumann-cidr20.pdf).
A versioned latch is essentially a small data structure (64 bits in our implementation) that encodes both lock state and a version counter (p29-neumann-cidr20.pdf). Umbra’s latch layout dedicates 5 bits for state (lock mode and counters) and 59 bits for a version number (p29-neumann-cidr20.pdf) (see Figure 3). The state bits indicate if the latch is free or locked in either exclusive, shared, or optimistic mode, and how many holders if shared (p29-neumann-cidr20.pdf). The 59-bit version is incremented whenever the page is modified, and it is used to validate optimistic reads (p29-neumann-cidr20.pdf). All updates to the latch (locking/unlocking) use atomic instructions to ensure thread safety (p29-neumann-cidr20.pdf).
Umbra’s latching modes work as follows:
-
Exclusive mode: This is a traditional write lock. A thread acquires exclusive access by atomically setting the latch’s state to a special code (we use state value 1 to denote exclusive) when it’s currently unlocked (p29-neumann-cidr20.pdf). Only one thread can hold exclusive at a time. While a page’s latch is in exclusive mode, no other thread can read or write it. Any operation that modifies a page (inserting a tuple into a page, splitting a node, etc.) requires first acquiring the page’s latch in exclusive mode to avoid data races (p29-neumann-cidr20.pdf). When the thread is finished modifying, it releases the exclusive latch by resetting the state to unlocked (0) and increments the version counter if the page’s content was changed (p29-neumann-cidr20.pdf). This version bump signals to other threads that any prior optimistic read may have seen stale data.
-
Shared mode: This is a read lock that multiple threads can hold concurrently. A latch can be acquired in shared mode if it is currently unlocked or already in shared mode (not exclusive) (p29-neumann-cidr20.pdf). We encode the number of shared holders in the state bits: we use a bias where state value n+1 indicates n threads holding the latch in shared mode (p29-neumann-cidr20.pdf). For example, state=2 means 1 shared-holder, state=3 means 2 holders, etc. (The value 1 is reserved for exclusive). The 5-bit field allows up to 31 concurrent shared holders directly; if more threads need to share, Umbra falls back to an overflow counter (another 64-bit counter) to track the rest (p29-neumann-cidr20.pdf) – but in practice, 31 simultaneous readers on one page is rare. While a page is locked in shared mode, threads can read its content but must not modify it, and no exclusive lock can be taken until all readers release the latch (p29-neumann-cidr20.pdf). Importantly, a shared latch pins the page in memory: Umbra treats a page with any shared latch as in-use, so it will not evict that page until the last reader is done (p29-neumann-cidr20.pdf). This guarantees a reader won’t have the page swapped out from under it. When each reader is finished, it decrements the count in the state bits (atomically). The last shared reader resets the state to 0 (unlocked) (p29-neumann-cidr20.pdf).
-
Optimistic mode: This is the lightweight mode for reading without formal locking. Any number of threads may hold an optimistic latch on a page concurrently, even together with shared readers (or even if the latch was free) (p29-neumann-cidr20.pdf). Acquiring an optimistic latch simply means reading the current version number and remembering it; the latch’s state is not changed, so this does not block other threads at all (p29-neumann-cidr20.pdf). Essentially, a thread says “I will read the page, but I promise to double-check later if someone modified it.” While in optimistic mode, a thread may read the page’s data, but it must be prepared for the read to be invalidated. If another thread takes exclusive and modifies the page, it will increment the version. Therefore, when an optimistic reader is done, it must validate by re-reading the latch’s version and comparing it to the version it saved (p29-neumann-cidr20.pdf). If the version is unchanged, no exclusive writer interfered and the read was consistent. If the version changed, a write occurred; the optimistic reader must retry its operation (e.g. re-traverse the B-tree page from the root or reacquire the page) (p29-neumann-cidr20.pdf). In practice, this means on a validation failure the operator will restart the logic for that page, likely finding the page via fresh pointer after it’s updated.
Optimistic latching greatly reduces synchronization cost because reading threads don’t perform atomic operations on the latch at all during the bulk of their work – they only do an atomic read at the start and a read at the end to check the version. There is no contention induced for multiple readers; they can all read in parallel even on the same cache line without causing writes (p29-neumann-cidr20.pdf). Writers only contend when acquiring exclusive, as expected.
Umbra’s versioned latch design has a couple of important nuances. First, it even tolerates the case where a page is evicted while being read optimistically. Because of our virtual memory trick, the page’s virtual address is still valid; if the page was evicted (MADVDONTNEED applied), any attempted memory access will just return zero data from a global “zero page” (p29-neumann-cidr20.pdf). The optimistic reader will find garbage (zeros) instead of real data, but crucially it will notice the version change when validating (since eviction for a write or exclusive lock counts as a modification) and then retry. In other words, the system guarantees no segmentation fault or illegal memory access even if an optimistic read races with eviction – the read might just see an empty page and then retry. This allows Umbra to use optimistic reads freely without complex pin/unpin logic; _it is even legal for a page to be unloaded while being read optimistically (p29-neumann-cidr20.pdf), because the address remains valid and mapped to zero-fill.
Second, LeanStore only implemented optimistic mode for readers and did not have a true shared-read mode (p29-neumann-cidr20.pdf). Umbra extends this by supporting shared latches for readers as well (p29-neumann-cidr20.pdf). The rationale is that our query operators generally are oblivious to page boundaries. A sequential scan or join in Umbra doesn’t explicitly check if it crossed to a new page; it just follows pointers in the B+-tree or storage. If we only had optimistic latches, every operator would need to include logic to validate after processing each page, and re-fetch the page if it was evicted mid-operation (p29-neumann-cidr20.pdf). This would complicate the execution engine and could lead to many retries if, say, a long scan runs concurrently with inserts that cause evictions. By allowing readers to take a shared latch on pages they are scanning, Umbra ensures those pages stay in memory throughout the scan and the operator code doesn’t need extra validation checks on every tuple access (p29-neumann-cidr20.pdf). In read-heavy analytics (OLAP) with occasional writes, shared mode also prevents repeated validation failures – once a reader has a shared lock, a writer will wait rather than invalidating the reader repeatedly (p29-neumann-cidr20.pdf). Thus, Umbra’s concurrency control on the buffer pool is a mix of optimistic (for traversals and short-term reads) and shared (for longer scans or when simplicity is needed), combined with exclusive for writes. This approach retains LeanStore’s high read concurrency but is more robust for long scans and simplifies upper-layer operation logic.
Building on the buffer manager’s facilities, Umbra stores tables (relations) in a paged format backed by B+-trees. Every tuple inserted into a table is assigned a synthetic 8-byte tuple identifier (TID), which serves as the key in the primary B+-tree for that relation (p29-neumann-cidr20.pdf). These TIDs are generated such that they increase strictly monotonically over time (p29-neumann-cidr20.pdf) (for instance, a table could have an auto-incrementing surrogate key). This design has a profound effect: inserts always occur at the logical end of the index, and we never split existing pages when inserting new tuples (p29-neumann-cidr20.pdf). Instead of splitting, Umbra allows pages to fill completely; when no space is left in the current leaf, we allocate a new page (a new leaf node) for the next tuple. Likewise, inner nodes (index nodes) do not split – they are created only when a new range of keys is needed beyond the capacity of existing nodes. By using synthetic, ever-increasing keys, Umbra effectively transforms arbitrary insertion patterns into append-like behavior at the B+-tree level, avoiding the costly page splits that plague B-trees under random insert patterns (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). This is a key innovation that lets us leverage variable-size pages efficiently.
All inner nodes of the B+-tree (which direct the search to the right leaf) are fixed to the smallest page size (64 KiB) in Umbra (p29-neumann-cidr20.pdf). This yields a very high fan-out (e.g. an inner node can have ~8192 pointers to children if 64 KiB with 8-byte keys/pointers) (p29-neumann-cidr20.pdf), minimizing tree depth. Because we never split inner nodes (only add new ones when necessary), the tree grows by adding new root levels or branches rather than splitting existing levels. Leaf nodes, which store the actual tuples, are allocated in whatever page size is needed to hold the data. In most cases, a single 64 KiB leaf can hold many tuples (especially if tuples are small), so the majority of leaves are 64 KiB (p29-neumann-cidr20.pdf). However, if a tuple is large (for example, it has a very large string or blob), Umbra will allocate a larger leaf page to fit that tuple rather than splitting it across multiple pages (p29-neumann-cidr20.pdf). It chooses the smallest page size that can accommodate the entire tuple. Because typical tuples easily fit in 64 KiB, splits are rare and most data sits in standard-size pages. This approach gives a good balance between point query performance (small pages mean less unnecessary data per access) and range scan throughput (large tuples or groups of tuples can reside contiguously in larger pages) (p29-neumann-cidr20.pdf). Essentially, Umbra avoids ever fragmenting a single tuple across pages, and keeps the structure compact for cache efficiency.
Within each leaf page, Umbra currently uses a PAX layout (p29-neumann-cidr20.pdf) (Partition Attributes Across). This means the page’s fixed-length fields from all tuples are stored in arrays (columns) at the beginning of the page, and any variable-length fields (like strings) are stored in a contiguous area at the end of the page (p29-neumann-cidr20.pdf). PAX is beneficial for CPU cache performance because iterating over a single column’s values in a page is cache-friendly (p29-neumann-cidr20.pdf). When inserting into a page, if needed, Umbra will compact or defragment the variable area to make room (since deletions or updates might leave gaps) (p29-neumann-cidr20.pdf). While PAX is great for in-memory processing, it has a downside for disk-resident data: if a query only needs a few columns of a table, the buffer manager still must load the entire page with all columns (p29-neumann-cidr20.pdf). This is fine for hot data but not optimal for cold, disk-resident data where I/O is precious. Umbra’s designers acknowledge this and plan to integrate alternative page layouts for colder data segments (p29-neumann-cidr20.pdf). For example, DataBlocks (groups of columns stored separately) might be used for cold data so that only needed columns are read from disk (p29-neumann-cidr20.pdf). The storage layout could thus adapt: hot pages use PAX for speed, cold pages use a more columnar layout to save I/O.
Concurrent access to relations (the B+-trees) is coordinated by optimistic latch coupling, heavily leveraging the versioned latches described earlier (p29-neumann-cidr20.pdf). Umbra uses a form of lock coupling (latch coupling) when traversing the tree: threads acquire latches on parent and child nodes in a disciplined order to maintain safety. Specifically, writers (inserts/deletes) acquire exclusive latches on nodes as needed – but Umbra’s append-only insertion means a writer usually only takes exclusive on the leaf it’s inserting into (or a new leaf and possibly a new inner node) without splitting existing nodes (p29-neumann-cidr20.pdf). In the rare case of adding a new node, the parent might be latched exclusive briefly to link the new child. Readers use shared latches when they reach a leaf node to read tuples, or when they need to load a child from disk (to pin it in memory) (p29-neumann-cidr20.pdf). During the tree traversal through inner nodes, Umbra employs optimistic latching: it will navigate down the tree acquiring optimistic latches (essentially reading the version) on each node, and only convert to a shared latch at the leaf or an exclusive latch if it needs to modify something (p29-neumann-cidr20.pdf). Each step of the traversal validates that the parent’s version hasn’t changed before proceeding down to the child. This is classic optimistic coupling: validate that the node you came from wasn’t modified (which would invalidate your route) (p29-neumann-cidr20.pdf).
One twist in Umbra’s B+-trees is the absence of sibling pointers, as mentioned. Because each page has only one parent reference (no multiple incoming pointers), leaf pages don’t know about their neighbors. To perform a full table scan (visit all leaf pages in order), Umbra cannot simply follow sibling links. Instead, Umbra keeps an optimistic latch on the parent node of the current leaf during a scan (p29-neumann-cidr20.pdf). When a thread finishes processing a leaf, it uses the still-optimistically-held parent to find the next child (leaf) in sequence (p29-neumann-cidr20.pdf). Essentially, the parent’s array of child pointers is used to get the next leaf. This works as long as the parent wasn’t modified; the optimistic latch on the parent is validated to ensure it’s still current (p29-neumann-cidr20.pdf). If it is valid, moving to the next leaf is just an array index increment and an access via the parent’s swip. If some concurrent operation split the parent or otherwise changed it (version change), the scan thread would detect that and perhaps re-traverse from a higher level to find where it left off. In practice, for large scans (typical OLAP), concurrent structural modifications are rare, and this method allows near-sequential leaf visiting without additional pointers. Thus, Umbra can scan a tree fairly efficiently even without sibling links (p29-neumann-cidr20.pdf).
Overall, Umbra’s storage engine (buffer-managed relations) marries the B+-tree indexing benefits with the simplicity of append-only growth and flexible page sizes. By never splitting a page in the middle and by choosing page size per tuple, Umbra avoids many corner-case costs (like overflow chains or large object handling). The combination of optimistic and shared latches gives high concurrency. This architecture is a blend of ideas from in-memory systems (dense packing, sequential writes) and traditional storage (ARIES logging, etc.), tailored for SSD performance.
For transaction durability and crash recovery, Umbra relies on the established ARIES recovery algorithm (Write-Ahead Logging with redo/undo) (p29-neumann-cidr20.pdf). ARIES is largely compatible with our buffer manager even though we allow variable page sizes – log sequence numbers (LSNs) and redo/undo logic work per page as usual, and ARIES can handle the fact that different pages may have different sizes (p29-neumann-cidr20.pdf). However, one subtle issue arises with reusing disk space across pages of different sizes, which Umbra must handle carefully.
The problem is easiest explained by example: suppose we had a 128 KiB data page on disk (size class 2) occupying a certain region of the database file (p29-neumann-cidr20.pdf). Now assume that at runtime, that 128 KiB page is loaded into memory and then the tuple it contained is deleted, freeing that page. The system might reuse that freed space to store, say, two 64 KiB pages (size class 1) – perhaps we insert new data that happens to only need 64 KiB pages, and it finds that free 128 KiB slot and divides it into two smaller pages (p29-neumann-cidr20.pdf). Everything is fine in memory, and WAL records are generated for the deletion and the two insertions (with their page allocations). But imagine a crash occurs after the log is written for these actions but before the new 64 KiB pages are flushed to disk (p29-neumann-cidr20.pdf). During recovery, ARIES will see log records indicating that two new 64 KiB pages should exist in that file space. It will, at some point, attempt to redo operations on those pages or at least retrieve their LSNs to decide if they need redo. If ARIES goes to read the second 64 KiB page from disk in that location, it will actually be reading the bytes of the old 128 KiB page (since we crashed before writing the new data) (p29-neumann-cidr20.pdf). Those bytes could contain what looks like a valid LSN or data from the deleted page, which could confuse recovery. Essentially, ARIES might interpret stale data as part of the new page, violating the WAL consistency.
To avoid such scenarios, Umbra never reuses freed disk space for a page of a different size (p29-neumann-cidr20.pdf). If a 128 KiB page is freed (deleted), that 128 KiB slot on disk can only be reallocated to another 128 KiB page in the future. Space is managed per size class – effectively, we maintain separate free lists for each page size. This way, a crash recovery will never find a smaller “partial” page where a larger page used to be, eliminating the chance of reading mixed-up page contents (p29-neumann-cidr20.pdf). The trade-off is some internal fragmentation on disk (we might have free space in large chunks that can’t be used for smaller pages until perhaps the system is restarted or space is reorganized offline), but given typical workloads and large contiguous files on SSD, this is an acceptable price for correctness. ARIES can then recover Umbra’s database safely: it will never encounter an LSN belonging to an old page when expecting a different page size’s LSN.
Aside from that, Umbra’s recovery operates like a standard ARIES-based system with write-ahead logging, checkpointing, and so on, which the paper doesn’t detail but references the ARIES paper (p29-neumann-cidr20.pdf). The key point is that Umbra proved ARIES can seamlessly support varying page sizes as long as we constrain space reuse appropriately (p29-neumann-cidr20.pdf). Thus, Umbra remains fully transactional with robust crash recovery, just like its in-memory predecessor HyPer (which also used ARIES for durability).
While the buffer manager is the centerpiece enabling Umbra’s in-memory performance with disk-based data, several other system components required adaptation or redesign compared to the pure in-memory HyPer (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). In this section, we discuss three important areas of change: string data handling, statistics collection, and the query compilation/execution model. Each of these was adjusted to better suit a disk-based, large-memory scenario (where data may be evicted from RAM), while still maintaining performance.
Strings and other large variable-length data types pose a challenge in a disk-based system because their values might be large and not always memory-resident. Umbra introduces a new string storage format to efficiently manage strings whether they are in memory or on disk. In Umbra, each string attribute is stored in two parts: a fixed-size 16-byte header, and a variable-size body that contains the actual string characters (p29-neumann-cidr20.pdf). The 16-byte header lives alongside other fixed-length attributes in the normal tuple storage (for example, in the fixed-field area of a PAX page) (p29-neumann-cidr20.pdf). The variable-length body is stored in a contiguous buffer, typically at the end of the page (so growing or shrinking the string body doesn’t disturb fixed fields) (p29-neumann-cidr20.pdf). Because Umbra’s buffer manager supports large pages, a very long string can be stored entirely within one page’s body section – Umbra does not need to fragment a long string across multiple pages or store overflow pointers as some systems do (p29-neumann-cidr20.pdf). It can simply allocate a larger page for a tuple that contains a very long string, keeping the whole string in one piece.
The 16-byte string header contains metadata and possibly a small string fragment depending on the string’s length (p29-neumann-cidr20.pdf). The layout is as follows (see Figure 4 in the paper): the first 4 bytes of the header store the length of the string (an unsigned 32-bit length, allowing strings up to 232−1 characters) (p29-neumann-cidr20.pdf). The remaining 12 bytes are used in two ways. If the string is “short” (defined as 12 bytes or fewer), Umbra uses those 12 bytes to store the actual string content in-place in the header (p29-neumann-cidr20.pdf). This means short strings incur no extra pointer indirection or separate storage – they are retrieved directly from the header. This optimization is valuable because many string fields (e.g. names, codes) are short and can be handled like fixed-size data for speed. If the string is longer than 12 bytes, Umbra stores the string’s content out-of-line in the variable-size body (which could be elsewhere on the page or even outside the page), and in that case the header’s remaining 12 bytes are used differently: it will contain either a pointer or an offset that locates the string’s body (p29-neumann-cidr20.pdf). Specifically, if the string’s body is stored within the same database page, Umbra stores an offset from the page start to the beginning of the string data (p29-neumann-cidr20.pdf). If the string’s body is stored somewhere else (e.g. in a separate heap or scratch area, which can happen for certain transient strings), then the header stores a full 64-bit pointer to that location (p29-neumann-cidr20.pdf). In either case, one bit of this pointer/offset is used to encode the storage class (discussed below). Additionally, when a string is out-of-line (long string), Umbra uses 4 of the header bytes (which are not needed for content in this case) to store a prefix of the string’s data – specifically, the first 4 characters (p29-neumann-cidr20.pdf). This acts as a small cache to speed up comparisons: for instance, when comparing two long strings, Umbra can compare the 4-byte prefixes first and avoid fetching the full string data unless the prefixes match (p29-neumann-cidr20.pdf). This trick helps short-circuit inequality checks in sorting or hashing.
A key complication in a disk-based system is that an out-of-line string’s body might not always be in memory. The pointer or offset in the header could dangle if the page holding the string body is evicted from memory (p29-neumann-cidr20.pdf). In an in-memory system like HyPer, a pointer to a string was always valid for the duration of a query, because nothing ever got paged out. In Umbra, however, we must be careful: a query could load a string from a page, then that page might be evicted (or the tuple deleted) before the query finishes, invalidating the pointer. To handle this, Umbra categorizes out-of-line strings into three storage classes that define their lifetime and validity (p29-neumann-cidr20.pdf). We encode the storage class in 2 bits of the pointer/offset value (taking advantage of alignment or high bits). The three classes are:
-
Persistent strings: These are strings that remain valid for the entire uptime of the database (or until explicitly deleted). For example, a string constant in a query or a globally cached dictionary value might be considered persistent. If a string is marked persistent, the pointer/offset in its header will always remain valid; Umbra will ensure that memory is not reclaimed or reused for something else while the database is running (p29-neumann-cidr20.pdf). (If it’s on a page, that page won’t be evicted; if it’s in a global area, it won’t be freed.) Persistent is the highest durability in memory.
-
Transient strings: These are strings tied to the current unit of work (e.g. the current query or current transaction) (p29-neumann-cidr20.pdf). They remain valid during query execution but may become invalid afterward. In Umbra, any string that comes from a relation (table) is treated as transient (p29-neumann-cidr20.pdf). This is because the page backing that string could be evicted or changed after the query, so you cannot hold on to the pointer long-term. If the query needs to keep that string (e.g. to materialize an output or to hash it for a join that outlives the page access), Umbra will copy the string out to some safe storage (like a query-local buffer) to ensure it stays valid (p29-neumann-cidr20.pdf). Essentially, transient means “this pointer is fine for now, but don’t trust it after the current operation; copy if you need it later.” Most table data strings fall in this category due to possible eviction.
-
Temporary strings: These are strings that are actually constructed during query execution and exist only as intermediate results (p29-neumann-cidr20.pdf). For example, the result of applying the
UPPER()
function to a string, or a concatenation of two strings in a SELECT, would produce a new string at runtime. Umbra labels such values as temporary. They can be stored in some query-managed memory (not necessarily in a database page) and can live as long as needed within the query, but ultimately they must be garbage-collected (freed) when they are no longer needed (p29-neumann-cidr20.pdf). Temporary strings essentially represent the typical “string values” you’d have in an expression evaluation engine, which you free after processing.
By encoding these classes in the pointer/offset, Umbra’s execution engine knows how to treat the string. For instance, if it sees a transient string that is about to leave the context of its page (say, being added to a hash table for a join result), it will allocate space and copy the string to make it persistent or at least ensure a valid reference. Temporary strings might be moved to a persistent area if needed for output, or simply freed when done. Persistent strings need no special handling. This classification ensures that even if a page is evicted, any string that was being actively used either had been copied (if transient) or was known to be persistent (hence page wouldn’t be evicted) or was temporary (which we managed separately). In short, Umbra’s string handling provides in-page storage for efficiency, avoids page-crossing for long strings (thanks to big pages), and maintains correctness by tracking string lifetime. This is a significant improvement over HyPer’s approach, which could assume pointers were always valid; Umbra must do a bit more work but keeps it manageable and fast.
Accurate statistics are vital for the query optimizer to choose good execution plans. Umbra gathers extensive runtime statistics, but doing so in a disk-based context required rethinking how statistics are maintained and updated. Two key types of stats Umbra focuses on are random samples of tables and cardinality estimates via sketches.
Umbra continuously maintains a random sample of each relation for use in query optimization (p29-neumann-cidr20.pdf). In a pure in-memory system like HyPer, one approach was to periodically sample the table (perhaps at certain intervals or on certain updates). However, in a disk-based system, scanning a large table to produce a sample on-demand is prohibitively slow – you can’t afford to read an entire table from SSD just to refresh stats frequently (p29-neumann-cidr20.pdf). Even in HyPer, computing a fresh sample can be expensive, so HyPer would only update samples periodically, which meant the samples could become stale if the data changed a lot between updates (p29-neumann-cidr20.pdf). Stale stats can mislead the optimizer. Umbra’s solution is to implement a scalable online reservoir sampling algorithm (p29-neumann-cidr20.pdf) that updates the sample continuously as data is modified. Citing the paper, “Umbra tracks a number of statistics... First of all, a random sample of each relation is maintained...” and to do so efficiently, “Umbra… implement[s] a scalable online reservoir sampling algorithm that we recently developed [2]” (p29-neumann-cidr20.pdf). Reservoir sampling allows maintaining a true random sample in one pass and updating it as new data comes in, without scanning the whole table for each update. The referenced work [2] presumably describes a multi-core friendly algorithm for this. The result is that the optimizer always has an up-to-date random sample of each table with minimal overhead during inserts/updates (p29-neumann-cidr20.pdf). This sample can be used to estimate things like value distributions, selectivities, etc., in a way much more accurate than simplistic uniform assumptions.
In addition to samples, Umbra maintains updateable HyperLogLog (HLL) sketches for each column (p29-neumann-cidr20.pdf). HyperLogLog is a probabilistic data structure that estimates the number of distinct values (cardinality) in a multiset, using very little memory. Umbra’s HLL sketches are maintained incrementally as data changes (there is prior work [6] by the authors on making HLL updateable) (p29-neumann-cidr20.pdf). According to the paper, these sketches “can provide almost perfect cardinality estimates on individual columns with moderate overhead” (p29-neumann-cidr20.pdf). This means that for any given column (attribute), Umbra’s optimizer can get a very accurate estimate of how many distinct values that column has, even as the data changes, without scanning the whole column. Furthermore, Umbra combines the use of HLL and sampling to get better multi-column and predicate selectivity estimates (p29-neumann-cidr20.pdf). For example, if you need the joint distribution of two columns (for a join or a multi-column predicate), a combination of per-column sketches and the sample can be used to derive an estimate (p29-neumann-cidr20.pdf). The authors have prior work on this topic (Every row counts [6]) that likely fed into Umbra.
Together, the always-updated sample and sketches mean Umbra’s query optimizer has fresh and accurate statistics at its disposal without needing expensive stats recalculation jobs. This is particularly important in a disk-based system: you want to avoid large I/O just to maintain stats. Umbra’s approach ensures stats are piggy-backed on data modifications (which are happening anyway) and kept accurate. This helps Umbra choose better query plans than systems with stale or coarse stats. In the experiments, they note that MonetDB sometimes picks extremely bad plans due to poor stats, whereas Umbra (and HyPer) generally do much better (p29-neumann-cidr20.pdf).
Another statistic mentioned is that Umbra even maintains “numerical statistics [that] can estimate derived values of aggregates” (from the Umbra website) – possibly some linear regression or histogram for things like average values, etc., though the paper explicitly highlights sampling and HLL. In any case, Umbra’s optimizer inherits HyPer’s sophisticated costing but with improved runtime stats.
Umbra uses a compiling query execution engine, like HyPer, meaning that SQL queries are translated to machine code at runtime for execution (p29-neumann-cidr20.pdf). However, Umbra’s execution engine is more flexible and adaptive than HyPer’s, reflecting lessons learned and a desire to reduce compilation overhead. The high-level flow is: a logical query plan (from the SQL optimizer) is turned into a physical plan, which Umbra then dynamically translates into code for execution. The differences lie in how that code is represented and generated.
First, Umbra represents the physical plan as a modular state machine of pipelines and steps, rather than one monolithic piece of code (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). In HyPer, once the optimizer chose a plan, the codegen would produce one function per query (or per pipeline) that was a straight-line piece of C++/LLVM IR implementing the entire operation end-to-end (p29-neumann-cidr20.pdf). Umbra breaks the work into smaller units. It identifies pipelines in the plan – a pipeline is a sequence of operators ending in a blocking point (like a pipeline ends when you have to materialize or when a hash join finishes building, etc.). Umbra further subdivides each pipeline into steps, which are smaller tasks that can be executed either by a single thread or in parallel by multiple threads (p29-neumann-cidr20.pdf). Essentially, each step is like a piece of the pipeline state machine. For example, consider a query SELECT count(*), s_nationkey FROM supplier GROUP BY s_nationkey
. The plan might have a pipeline that scans the supplier
table and builds a hash table of groups, and a second pipeline that reads the hash table and produces the output. Umbra would generate something like the steps depicted in Figure 5 (p29-neumann-cidr20.pdf):
-
In Pipeline 1: (Step 1) thread-local setup (e.g. initialize a thread-local hash table partition), (Step 2) thread-local aggregation (scan a portion of
supplier
and build local groups), (Step 3) allocate a global hash table, (Step 4) merge all thread-local hash tables into the global one, (Step 5) thread-local cleanup. These steps encompass the first pipeline of scanning and aggregating. Steps 1, 2, and 5 would be run by each worker thread in parallel (marked as parallel steps), whereas Step 3 and 4 are likely single-thread (one thread allocates the global structure, one thread merges – or merge could also be parallelized but the figure suggests a sequence). -
In Pipeline 2: (Step 6) thread-local setup (perhaps each thread prepares to scan a portion of the global hash table), (Step 7) each thread scans its part of the global hash table and writes out results, (Step 8) thread-local cleanup. Steps 6 and 8 are per-thread, Step 7 is the main parallel scan of groups.
The figure illustrates these pipelines and steps with arrows indicating transitions and which steps are parallel (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). The key point is Umbra generates code on a per-step basis, not one giant function for the whole query. Each step can be thought of as a small piece of the query execution that has a clear start and end state.
Next, Umbra uses a custom Intermediate Representation (IR) to generate code for each step, instead of directly emitting, say, x86 assembly or LLVM IR for the entire query. The Umbra IR is a lightweight, tailor-made set of instructions that capture the operations the step needs to perform (p29-neumann-cidr20.pdf). It is designed to be close to LLVM IR in semantics but much simpler, avoiding the general-purpose overhead that LLVM incurs (p29-neumann-cidr20.pdf). In HyPer, the code generator emitted LLVM IR and then invoked the LLVM compiler to optimize and JIT compile it. While effective, that approach can incur significant overhead: LLVM might run many optimization passes, and function compilation in LLVM isn’t trivial cost. Umbra’s IR, by being simpler and generated directly from the plan, streamlines the initial code generation.
However, Umbra does not always immediately compile its IR down to machine code (p29-neumann-cidr20.pdf). Instead, Umbra employs an adaptive execution strategy (as described in an ICDE 2018 paper by A. Kohn et al. [10]) to balance compilation time against execution time. When a query is first deployed, each step’s IR is interpreted using a built-in virtual machine (VM) interpreter (p29-neumann-cidr20.pdf). Umbra translates the IR instructions of the step into a compact bytecode and executes them with a fast interpreter. This has a very low latency: even very short queries can start running immediately without waiting for a JIT compile. As the query runs, Umbra monitors the performance of parallel steps – for example, it might count how many iterations or tuples each step is processing, or how long it’s taking (p29-neumann-cidr20.pdf). If a certain step is executed many times (e.g. scanning millions of tuples) or for a long duration, the system decides that the step is “hot” enough to justify the overhead of optimization and compilation (p29-neumann-cidr20.pdf). At that point, Umbra will compile that step’s IR into optimized machine code. It does this by converting its IR into LLVM IR (which is straightforward since Umbra IR was designed to closely mirror a subset of LLVM IR) (p29-neumann-cidr20.pdf), and then invoking the LLVM just-in-time compiler for that code (p29-neumann-cidr20.pdf). The resulting native code for the step is then executed for the remaining iterations of that step.
This adaptive JIT compilation means Umbra can have the best of both worlds: minimal startup overhead (since many queries will run entirely interpreted if they are fast anyway), and near-C++ performance for long-running queries or heavy portions of queries (since those will get JIT-compiled on the fly) (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). The system strives to “optimize the tradeoff between compilation and execution time for each individual step” (p29-neumann-cidr20.pdf). Notably, if a query is very short or if a step doesn’t process much data, Umbra might never compile it at all, saving the compile time completely. If a step turns out to be a bottleneck, Umbra spends the effort to optimize it. This is a contrast to HyPer, which always paid the cost of full compilation for every query plan, even if the query was trivial or the plan did very little work. The authors observed cases where HyPer spent significantly more time compiling a query than executing it (especially on simple queries), which is wasted effort (p29-neumann-cidr20.pdf). Umbra avoids that by defaulting to interpretation and only compiling where beneficial.
In summary, Umbra’s execution engine is pipeline-aware and adaptive. It introduces an IR and a VM to get running quickly, and an on-demand compiler to accelerate heavy workloads (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). This is often referred to as “adaptive execution of compiled queries” (p29-neumann-cidr20.pdf). The resulting performance is that Umbra can handle both simple and complex queries efficiently: simple queries don’t suffer huge compile overhead, and complex or long-running queries still run at top speed after the initial warm-up. The experiments confirm that Umbra’s approach outperforms HyPer on workloads with many small queries due to this adaptivity (p29-neumann-cidr20.pdf).
Another aspect of Umbra’s execution model is that it naturally supports parallel execution (utilizing multiple cores). The modular “steps” correspond well to a morsel-driven parallel model (each thread can work on a step for a partition of data). Umbra doesn’t need a separate vectorized engine or interpreter for parallelism – the same IR/VM approach works with multi-threading. The steps marked as parallel can be run by all worker threads, and Umbra coordinates at pipeline boundaries or with barrier steps (like the merge step). This is similar to HyPer’s morsel-parallel execution, but Umbra’s finer granularity might allow more dynamic scheduling or better load balancing for long queries.
SSD-aware architecture: Throughout Umbra’s design, there is an implicit assumption and optimization for modern SSD characteristics. Unlike spinning disks, SSDs handle random reads remarkably well (no seek penalty, just some latency that can be overcome with parallel requests). Umbra’s buffer manager and execution engine take advantage of this by allowing a high degree of parallelism in page access. In the large scan microbenchmark (see Experiments), Umbra’s query threads issue many parallel reads that end up hitting the SSD in a somewhat random pattern, yet the aggregated throughput is very high (over 1 GB/s) (p29-neumann-cidr20.pdf). Umbra does not enforce sequential scan order at the device level – threads can fetch whatever pages they need next, and the OS/SSD will handle it. This is fine on SSD and actually keeps the device busy with multiple outstanding reads. Additionally, Umbra’s use of mmap/pread
means it is okay with the OS read-ahead or caching as needed, but since access is often pattern-less in OLAP, Umbra explicitly issues reads page by page in parallel. The replacement policy is also SSD-tuned: the “cooling” (FIFO) queue means a page, once used, isn’t immediately thrown out (which could cause churn of loading and evicting the same page repeatedly). It lingers in memory a bit in case of reuse, but if not reused, it eventually is evicted. This avoids inefficient thrashing which is important even on SSD (too many writes or reads still cost performance). Moreover, Umbra doesn’t do aggressive write aggregation or anything – it relies on ARIES WAL for correctness and the OS for actual writing.
Umbra’s architecture is ready to scale with multiple SSDs as well. If you need more I/O throughput than a single NVMe can provide, you can place parts of the database on multiple SSDs. The introduction notes that with multiple SSDs you can get “excellent read bandwidths at a fraction of the cost of a pure DRAM solution” (p29-neumann-cidr20.pdf). In the experiments, they remark that the primary bottleneck for out-of-memory workloads is storage bandwidth, and that adding SSDs would allow Umbra to “rival the performance of a pure in-memory system even on workloads that do not fit into main memory alone” (p29-neumann-cidr20.pdf). This is a crucial point: Umbra can saturate the I/O available, so if you provision enough I/O (e.g., an array of SSDs), you can linearly increase the throughput of queries that go to disk. In other words, Umbra’s performance in the I/O-bound regime is only limited by hardware, not by software overhead.
Finally, Umbra’s compiled execution engine still benefits from many of HyPer’s innovations (like efficient data-centric code, SIMD usage, etc.), so when data is in memory it performs like a top-tier in-memory system. When data is on SSD, Umbra performs the necessary I/O but does so asynchronously and in parallel where possible, minimizing stalls. It also prepares for future hardware: the mention of open-channel SSDs (p29-neumann-cidr20.pdf) hints that Umbra’s design could leverage devices where the software can schedule flash reads more directly (perhaps scheduling around GC on SSD, etc.). So, Umbra’s architecture is SSD-aware in the sense that it assumes fast, parallel, random I/O is available and builds a system that can exploit it fully.
The Umbra paper provides an experimental evaluation with two parts: (1) a system comparison against HyPer and MonetDB to demonstrate Umbra’s performance on analytical workloads, and (2) microbenchmarks to isolate the impact of specific components like the buffer manager and I/O behavior.
For the comparative benchmarks, the authors used the Join Order Benchmark (JOB), which consists of complex analytical queries on a subset of TPC-H schemas, and the standard TPC-H suite (at scale factor 10) (p29-neumann-cidr20.pdf). They compare Umbra with HyPer (v0.6, the last research version of HyPer) and MonetDB (a well-known column-store, v11.33) on the same hardware (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). The hardware is an 8-core/16-thread Intel Core i7-7820X with 64 GiB RAM, and a Samsung 960 EVO NVMe SSD (500 GB) (p29-neumann-cidr20.pdf). Importantly, these tests were run with warm caches – each query was executed five times and the fastest run was reported (p29-neumann-cidr20.pdf). Thus, the comparison is essentially measuring the raw query processing performance (assuming the data is in memory or cached by the second run). This isolates the overhead of the DBMS processing itself, not disk delays, and is a fair measure of the efficiency of Umbra’s techniques (since Umbra should behave like an in-memory system when caches are warm).
Umbra vs. HyPer: Umbra performs excellently compared to its in-memory predecessor HyPer (p29-neumann-cidr20.pdf). On the JOB benchmark, Umbra’s runtime was 3.0× faster on geometric mean (p29-neumann-cidr20.pdf). On TPC-H SF10, Umbra was about 1.8× faster on average (p29-neumann-cidr20.pdf). These are significant speedups. A major reason for this advantage is Umbra’s adaptive compilation strategy (p29-neumann-cidr20.pdf). HyPer always compiles every query into optimized machine code, which is great for long queries, but for many short or moderate queries, HyPer ends up spending a lot of time in compilation overhead that isn’t paid back by execution speedup. The paper notes that HyPer “actually spends far more time on query compilation than on query execution” for some queries – up to 29× more time compiling than running in extreme cases (p29-neumann-cidr20.pdf). Umbra avoids this wasted work by interpreting cheap queries or parts of queries, so those queries run much faster end-to-end. Essentially, Umbra doesn’t pay a 100 ms compilation cost for a query that only takes 10 ms to run; HyPer would, making HyPer much slower for that query. This effect was particularly pronounced in JOB, which has many queries with complex joins but not always large data set scans – Umbra’s ability to skip unnecessary compilation gave it a big win. For TPC-H, which has more heavy lifting per query, the gap is smaller but Umbra still leads.
To be clear, if we ignore compilation time and look only at actual query execution time (processing tuples), Umbra and HyPer are in the same ballpark. The authors note that if you compare raw execution times with warm data, Umbra’s times fluctuate within about +30%/−30% of HyPer on JOB and within +10%/−10% on TPC-H on average (p29-neumann-cidr20.pdf). Some queries Umbra might be slightly slower, some slightly faster, largely due to different plan choices or minor code generation differences (p29-neumann-cidr20.pdf). In a few cases, Umbra picked a worse join order than HyPer (despite having better stats) due to some lucky cancellation of errors in HyPer’s optimizer (p29-neumann-cidr20.pdf). But there were also cases the other way around. The main takeaway is: Umbra achieved comparable per-query performance to HyPer (no significant regression from adding a buffer manager and B+-tree access), and due to better compilation policy, Umbra actually wins in end-to-end time for many queries. Achieving parity with HyPer’s in-memory execution was a design goal, and these results show Umbra met it.
Umbra vs. MonetDB: MonetDB is a disk-based column store (vectorized execution) known for good performance on analytical SQL, but it uses an interpreted vectorized engine and has an older query optimizer. Umbra significantly outperformed MonetDB on these benchmarks. The paper reports a geometric mean speedup of 4.6× on JOB and 2.3× on TPC-H in Umbra’s favor (p29-neumann-cidr20.pdf). Several factors contribute to MonetDB’s slower performance. MonetDB’s query optimizer and code generator introduce a lot of overhead for complex queries – the text notes MonetDB spends a large fraction of time in query optimization and generating its MAL plans, especially on the complex join-order benchmark queries (p29-neumann-cidr20.pdf). MonetDB also occasionally makes very poor planning decisions (the optimizer can pick suboptimal join orders or methods) leading to drastically slow executions on some queries (p29-neumann-cidr20.pdf). Even ignoring those outliers, MonetDB’s vectorized execution is not as efficient as Umbra/HyPer’s compiled code on a per-tuple basis (p29-neumann-cidr20.pdf). In essence, Umbra’s modern code-generation approach and accurate stats produce better plans and faster execution, so Umbra is markedly faster than MonetDB across the board. This isn’t surprising since MonetDB is an older system and the goal was to show Umbra isn’t just competitive with HyPer, but also a strong contender against other disk-based systems.
Figure 6 in the paper (not shown here) presented these comparisons as boxplots of per-query speedups (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). The median speedup of Umbra over HyPer is >2× and over MonetDB even higher, with some variability per query. Importantly, none of the HyPer vs Umbra differences come from having to fetch from disk (since caches were warm); it’s purely the software overhead differences – demonstrating that Umbra’s architectural changes did not slow it down at all, and in fact improved overall performance.
In the second part of the evaluation, the authors design microbenchmarks to isolate how much overhead Umbra’s new mechanisms introduce and how well Umbra utilizes I/O.
Buffer Manager Overhead: To measure if the buffer manager or paged storage imposes a penalty compared to a pure in-memory approach, they run the same analytical queries under three storage configurations (p29-neumann-cidr20.pdf):
- Normal Umbra: with the buffer manager, eviction enabled (default behavior).
- No eviction: a variant of Umbra where they disable page eviction, effectively meaning once a page is loaded it stays in memory for the query. This removes the need for some overhead like materializing transient strings (since no page will disappear mid-query) (p29-neumann-cidr20.pdf).
- Flat memory (no buffer manager): an alternative engine implementation where relations are simply stored in flat memory-mapped arrays, bypassing the buffer manager and B+-trees entirely (p29-neumann-cidr20.pdf). In this mode, there’s no pin/unpin, no pointer swizzling, and no index lookup – a scan just iterates over an array of tuples. This is as close to “in-memory baseline” as one can get.
They run all JOB and TPC-H queries under modes (2) and (3) and compare execution times to mode (1) (all with data in memory). They exclude the optimization and compilation time from these comparisons to focus purely on execution speed (p29-neumann-cidr20.pdf) (since the plans are the same in all modes). The result: very little difference. Disabling eviction changed query times by under 2% on average (p29-neumann-cidr20.pdf). Completely bypassing the buffer manager (and the B+-tree indirection) had at most a 6% average difference (p29-neumann-cidr20.pdf). In some cases the flat storage was a bit faster, in others Umbra was actually slightly faster (likely due to better cache behavior or alignment in B+-tree pages) (p29-neumann-cidr20.pdf). The maximum observed performance change for any query was about 30% (p29-neumann-cidr20.pdf), which is still modest and those were outliers. The clear conclusion is stated: “the buffer manager incurs negligible costs in Umbra.” (p29-neumann-cidr20.pdf) Even with an extra layer of pointer translation and tree traversal, Umbra’s execution is basically as fast as if data were laid out flat in memory. This validates their design choices (swizzling, versioned latches, etc.) as being low-overhead on modern hardware. It also means all the complexity of variable pages and a B+-tree storage engine did not unduly slow down data access.
I/O Throughput and SSD Utilization: Next, they illustrate that Umbra’s buffer manager can drive the SSD at full speed when needed. They constructed a microbenchmark where a large relation (250 million 8-byte floats, ~2 billion bytes ≈ 2 GiB) is stored on SSD (p29-neumann-cidr20.pdf). They then execute a query that computes the sum of all those floats (essentially a full scan aggregation) with cold caches, meaning neither the OS nor Umbra have the data in memory initially (p29-neumann-cidr20.pdf). They compare two scenarios: (a) using the buffer manager (normal Umbra) to read the data, and (b) bypassing the buffer manager by memory-mapping the file and letting the OS page in data as the query reads it (the flat memory scenario) (p29-neumann-cidr20.pdf). In case (b), Umbra is relying on Linux’s demand paging of the memory-mapped file to bring data in (which likely ends up doing readahead and page faults). In case (a), Umbra explicitly issues pread()
calls for each page via the buffer manager.
The outcome: Umbra achieved 1.15 GiB/s scan rate with the OS-controlled paging, and 1.13 GiB/s with its own buffer manager driving the I/O (p29-neumann-cidr20.pdf). These numbers are essentially the same (within measurement noise). 1.15 GiB/s is near the max random read throughput of the Samsung 960 EVO SSD – which aligns with expectations (it can do a few hundred thousand IOPS at maybe ~4KB each, or larger sequential reads at 3.5 GB/s; this workload with many threads likely lands somewhere in between, limited by random access patterns). The important insight is that Umbra’s explicit I/O did not bottleneck or fall behind OS paging; the buffer manager is fully utilizing available I/O bandwidth (p29-neumann-cidr20.pdf). The scan in Umbra is highly parallel (multiple threads each reading their partition of the data with no enforced order), resulting in effectively random I/O, but the SSD handles it well and Umbra keeps all threads busy. They even mention that read accesses happen “essentially at random due to the highly parallelized nature of query execution in Umbra which imposes no inter-thread constraints on the order pages are processed” (p29-neumann-cidr20.pdf). This highlights the SSD-aware approach: threads don’t wait in line; they all fetch different pages concurrently, turning the workload into a slew of random reads which the SSD serves concurrently.
To double-check, they also ran some benchmarks with artificially small buffer pool sizes to force Umbra to do actual I/O during the JOB and TPC-H queries, and observed that Umbra becomes I/O-bound and saturates the device similarly in those cases (p29-neumann-cidr20.pdf). No internal locking or algorithm in Umbra prevented it from scaling to the device throughput.
The conclusion from these microbenchmarks is twofold: (1) Umbra’s overheads (pointer swizzling, latching, etc.) add virtually nothing to query time when data is in memory, and (2) when data is not in memory, Umbra can drive modern storage to its limits, so performance is bound by hardware (disk speed) rather than software inefficiency (p29-neumann-cidr20.pdf). They explicitly state that the primary bottleneck is the storage throughput, and that by putting multiple SSDs in a system, one could linearly increase throughput, allowing Umbra to approach pure-memory performance even for larger-than-memory workloads (p29-neumann-cidr20.pdf). This validates the initial premise: with SSDs being so fast nowadays, a well-designed system can utilize them such that the gap between in-memory and disk-based execution is narrowed to a minimum.
Umbra is built upon a rich history of database system research. As described, it is essentially a re-engineering of our earlier HyPer system to remove the main-memory limitation (p29-neumann-cidr20.pdf). In that sense, Umbra aligns with arguments made by Dave Lomet and others that pure in-memory systems, while very fast, are not cost-effective or scalable for all use cases (p29-neumann-cidr20.pdf). Lomet’s piece [15] cited in the paper calls out the economic imbalance of memory vs storage, which Umbra directly addresses by introducing a disk-based architecture without losing much performance. Umbra’s philosophy is to combine the best of in-memory and disk-based designs, which relates to prior academic debates on memory-centric systems (e.g., Hekaton (p29-neumann-cidr20.pdf), SAP HANA (p29-neumann-cidr20.pdf)) versus hybrid systems.
The buffer manager in Umbra is heavily inspired by LeanStore (p29-neumann-cidr20.pdf). LeanStore (V. Leis et al., 2018 [14]) was a research prototype of a modern, scalable storage engine which introduced techniques like pointer swizzling in a hierarchical structure and optimistic latching – both of which Umbra uses. Umbra’s buffer manager shares many properties with LeanStore (e.g., the cooling technique for eviction, the latch design) (p29-neumann-cidr20.pdf). The paper defers to the LeanStore publication for a detailed survey of related work on buffer management and storage management in databases (p29-neumann-cidr20.pdf). Key relevant works likely include the classic literature on B+-trees and buffer replacement strategies, as well as modern variants like MMAP-based storage or OS direct paging. LeanStore itself was an evolution of these ideas optimized for many cores and NVMe, showing that a storage manager can be almost as fast as in-memory. Umbra takes LeanStore’s concepts and integrates them deeply into a complete DBMS with code generation.
One distinguishing feature of Umbra is its variable-size pages. According to the paper, the only other system that clearly supported variable page sizes in the buffer pool was a buffer manager for the ADABAS database system decades ago (p29-neumann-cidr20.pdf). That design only allowed two different page sizes and required segregating them into separate pools (with memory pre-allocated for each) (p29-neumann-cidr20.pdf). This static partitioning is inflexible – you have to guess how many large-page versus small-page objects you will have, and if guessed wrong, one pool could overflow while space in the other is free (p29-neumann-cidr20.pdf). Umbra’s approach is much more flexible, allowing many size classes and dynamically using memory wherever needed. By using OS virtual memory tricks, Umbra doesn’t have to dedicate fixed memory to each size class (p29-neumann-cidr20.pdf). To our knowledge (and the authors’ knowledge), Umbra is unique in this capability among high-performance systems. Most other systems stick to a single page size (often 4 KB or 8 KB) and handle larger objects at a higher layer (overflow pages, large object containers, etc.), which adds complexity and overhead. Umbra essentially pushes that complexity down into the OS/hardware level (which is efficient at handling memory fragmentation) and simplifies the DB logic.
In terms of compiled query execution and adaptive JIT, Umbra builds on prior work like HyPer’s original code generation approach (p29-neumann-cidr20.pdf) and later works on adaptive execution (the ICDE 2018 paper [10] by Kohn et al.) (p29-neumann-cidr20.pdf). Other systems have explored adaptive or tiered execution: e.g., SQL Server’s recent lightweight query execution, or adaptive partial compilation in Hyper’s commercial successor, or even Java systems that interpret then JIT. The idea of breaking queries into subcomponents (as Umbra’s pipelines/steps do) has connections to Volcano-style iterators or vectorized execution segments, but Umbra’s twist is doing it in a compiled context with runtime decisions. The paper by Neumann and Leis [10] is directly cited as the source of the adaptive execution idea (p29-neumann-cidr20.pdf).
Also related is work on cache-conscious data structures (like PAX, DataBlocks, etc.). Umbra’s use of PAX for in-memory efficiency draws from the PAX paper by Ailamaki et al. [1] (p29-neumann-cidr20.pdf). The mention of DataBlocks for cold data refers to research (e.g., by H. Lang et al.) on storing data in a more columnar compressed format for IO efficiency (p29-neumann-cidr20.pdf) (the reference [11] likely is a paper on hybrid OLTP/OLAP storage or the DSM vs NSM debate in a modern context). Umbra aims to incorporate such ideas in the future for cold storage management.
In summary, Umbra synthesizes ideas from in-memory systems (HyPer, Hekaton), storage managers (LeanStore, classical DB buffers), and adaptive execution. The related work section emphasizes that the concept of a high-performance disk-based system that behaves like an in-memory system is timely and that Umbra’s specific contributions (especially the variable-size page manager) fill a gap in existing designs (p29-neumann-cidr20.pdf). By citing ADABAS and LeanStore, they position Umbra as more flexible and modern. Umbra essentially contributes to the lineage of research that argues for architecture-aware database design – using OS capabilities and hardware trends (fast SSD, lots of virtual memory, many cores) to rethink how a disk-based DBMS should be built.
The CIDR 2020 paper presented Umbra, a new database system that evolves the pure in-memory HyPer into an SSD-based system without losing performance (p29-neumann-cidr20.pdf). The key takeaway is that by using a novel low-overhead buffer manager with variable-size pages, Umbra can achieve in-memory speeds when the workload fits in RAM, and gracefully transitions to disk I/O when data grows larger than memory (p29-neumann-cidr20.pdf). The design choices – from using large virtual address reservations to pointer swizzling, optimized latching, and adaptive compilation – all serve the goal of maintaining efficiency in the common case and avoiding catastrophic slowdowns in the uncommon case. The paper demonstrates that Umbra indeed hits this goal: the experiments showed it matching or beating an in-memory system on performance and effectively saturating SSD bandwidth when needed, with negligible internal overhead.
In conclusion, Umbra represents a “novel breed of high-performance data caching systems.” (p29-neumann-cidr20.pdf) It bridges the gap between in-memory and disk-based database architectures. The insights from Umbra’s design can guide future systems: for example, how to leverage OS virtual memory for flexible storage management, how to combine compilation and interpretation for efficiency, and how to structure on-disk data structures (like B-trees) to work harmoniously with such a system. As hardware evolves (NVRAM, faster SSDs, etc.), Umbra’s approach of treating I/O as an extendable layer of memory (with the OS and hardware helping to blur the line) is likely to become standard in database system design. Umbra shows that you can have your cake and eat it too: enjoy the blazing speed of in-memory processing and the scalability of disk-based storage in one system (p29-neumann-cidr20.pdf) (p29-neumann-cidr20.pdf). The system is fully implemented and ongoing work (since 2020) likely has further improved on these ideas, but the paper’s contributions – especially the variable-size page buffer manager – stand as a significant innovation in database storage engines.
Ultimately, Umbra demonstrates that with careful engineering, a disk-based DBMS can “achieve in-memory performance if the entire working set fits into RAM, while at the same time fully utiliz[e] the available I/O bandwidth if data has to be spilled to disk.” (p29-neumann-cidr20.pdf) This is a compelling result that moves the field beyond the dichotomy of in-memory vs. on-disk systems, suggesting that the future is hybrid and adaptive. The lessons from Umbra are useful not only for building new systems but also for adapting existing in-memory databases to embrace larger-than-memory datasets efficiently. The authors invite us to see Umbra as a guideline for designing such high-performance caching systems going forward (p29-neumann-cidr20.pdf), and indeed many of these concepts are finding their way into modern database engines.