Skip to content

Instantly share code, notes, and snippets.

@dt
Created May 11, 2026 23:13
Show Gist options
  • Select an option

  • Save dt/55344248c1de3399fcf5e70dffb2e4a4 to your computer and use it in GitHub Desktop.

Select an option

Save dt/55344248c1de3399fcf5e70dffb2e4a4 to your computer and use it in GitHub Desktop.
Per-SSTable MVCC Upper Time Bound: Design Evaluation

Per-SSTable MVCC Upper Time Bound: Design Evaluation

Context

PCR cutover today uses RevertRange RPCs that iterate every key in the cluster (modulo TBI block skipping) to find and tombstone writes newer than time T. This is O(changed data) in both reads and writes. The idea: attach an MVCC upper time bound to each SST via a single manifest edit, instantly masking all keys > T from all iterators. No loading blocks, no writing tombstones, no compaction of those tombstones — just one manifest write per store.

A secondary use case — cluster fork — motivates designing the feature to work on SSTs from live traffic (with intents), not just PCR's intent-free AddSSTable data.

Correctness Argument

Preconditions

  1. T < QueryResolvedTimestamp(span): The caller picks T below the resolved timestamp for the tenant's key span. The resolved timestamp guarantees that all intents at or below T have been fully resolved — lock table keys removed, provisional values either promoted to their commit timestamp or deleted.

  2. Memtable flush after picking T: All memtables on all stores are flushed before applying the SST-level bound. This ensures intent resolution writes (which were applied via Raft before the resolved timestamp could reach T) are materialized in SSTs, not stranded in memtables.

  3. MVCCUpperBound = T applied to every SST overlapping the span: Applied uniformly via a single VersionEdit per store. SSTs straddling the span boundary are first virtualized, with the bound applied only to the virtual SST covering the tenant's portion.

What the filter does

The MVCCUpperBound is a per-SST transform (like HideObsoletePoints or SyntheticSuffix) that masks point keys whose MVCC timestamp suffix is > T. It operates at two levels:

  • Block level: Existing MVCCTimeInterval block properties track [min_walltime, max_walltime] per block. Entire blocks where min_walltime > T.WallTime are skipped without loading.
  • Row level: In the cockroachKeySeeker, mvccWallTimes.At(row) and mvccLogical.At(row) are compared directly against T — two uint64 comparisons, no key materialization. Rows exceeding T are skipped.

Keys with non-MVCC suffixes (lock table keys with 17-byte version, unversioned meta keys) are never filtered — the suffix type check excludes them.

Why committed values at ≤ T are correct

By precondition (1), T < resolved TS means no unresolved intents exist at or below T. Therefore every MVCC point key at timestamp ≤ T in the global keyspace is a definitively committed value. No provisional values with ambiguous commit status exist at ≤ T — they've all been resolved:

  • Committed, no timestamp push: Lock table key deleted. Provisional value at write timestamp stays as the committed value. It's at ≤ T and visible through the filter. Correct.
  • Committed, timestamp pushed from T-5 to T-2 (both ≤ T): Intent resolution wrote a DEL at K@T-5 and a SET at K@T-2. Both are at ≤ T, both visible. The DEL shadows the stale provisional. The committed value at T-2 is visible. Correct.
  • Committed, timestamp pushed from T-5 to T+3 (crosses T): Intent resolution wrote a DEL at K@T-5 and a SET at K@T+3. The DEL at T-5 is visible (≤ T) and shadows the stale provisional. The SET at T+3 is masked (> T). Net: key not visible. Correct — the transaction's committed timestamp is after T.
  • Aborted: Intent resolution wrote a DEL at the provisional value's timestamp and deleted the lock table key. The DEL shadows the stale provisional. Key not visible. Correct.

Why intent resolution writes in different SSTs compose correctly

Intent resolution is async and its writes often land in a different SST than the original provisional value. This is fine because the merge iterator handles cross-SST interactions:

  • SST A (flushed early): contains stale provisional value K@T-5 (a SET)
  • SST B (flushed later, after resolution): contains DEL at K@T-5

The DEL in SST B has a higher sequence number than the SET in SST A. The merge iterator correctly resolves the DEL over the SET. Both keys have the same MVCC suffix (T-5), so both pass or fail the time bound filter identically. The shadowing works regardless of which SSTs the keys land in.

Precondition (2) — flushing memtables — guarantees that all resolution writes are in SSTs. The resolved timestamp could not have reached T unless the resolution Raft commands were applied, which means the resolution writes are in the engine. After flush, they're in SSTs and participate in the merge.

Why the dual-iterator pattern is not needed

The MVCCIncrementalIterator uses a dual-iterator pattern (filtered TBI + unfiltered main iterator) for two reasons:

  1. TBI was historically an optimization hint, not a correctness guarantee — the unfiltered iterator was the source of truth.
  2. MVCCClearTimeRange needs to see older versions below the cleared range to compute correct MVCC stats when writing tombstones (mvcc.go:3394-3405).

Neither applies here:

  • The SST-level bound is a property of the file, not an iterator option. There is no "unfiltered view" — the file's metadata says keys above T don't exist, the same way virtual SST bounds or HideObsoletePoints work. All iterators see the same filtered view.
  • No tombstones are written, so no stats delta computation is needed. Stats can be recalculated lazily or during compaction.
  • The merge iterator composes per-SST filtered views naturally, identical to how virtual SSTs with different bounds already work in Pebble.

Why lock table keys are unaffected

Lock table keys have 17-byte suffixes (lock strength + txn UUID), not MVCC timestamp suffixes (8, 12, or 13 bytes). The cockroachKeySeeker classifies rows by suffixTypes; the MVCC upper bound check only applies to rows with MVCC timestamp suffixes. Lock table keys pass through unfiltered.

By precondition (1), all intents at ≤ T have been resolved, so any remaining lock table keys are either:

  • For intents at > T (their lock table DELs and provisional value DELs are also at > T and may or may not be visible, but the intent itself is above our snapshot point — irrelevant)
  • Already shadowed by their resolution DELs in a newer SST

Why transaction records are not needed

Transaction records are range-local keys at \x01k{anchorKey}{txnSuffix}{txnID} (keys/keys.go:544-546), stored outside the tenant's global key span [/Tenant/N/, /Tenant/N+1/).

For the cluster fork use case (which copies only global keyspace), txn records are not available. This is acceptable because precondition (1) — T < resolved TS — guarantees no unresolved intents exist at ≤ T. There is nothing to resolve, so txn records are never consulted.

What about the resolved timestamp itself?

The resolved timestamp is not persisted. It is computed as min(closedTS, oldest_unresolved_intent_timestamp - 1). Two ways to obtain it:

  1. Active rangefeed: ScheduledProcessor maintains it in-memory per range, updated as Raft applies intent write/commit/abort ops. Published via RangeFeedCheckpoint. State: per-processor, in-memory, not persisted, reconstructed on rangefeed start by scanning the lock table.
  2. On-demand: QueryResolvedTimestamp RPC (cmd_query_resolved_timestamp.go:43-88) computes it by scanning the lock table and combining with the closed timestamp. No active rangefeed needed.

The resolved timestamp typically trails "now" by seconds (bounded by the oldest active transaction's write timestamp), so the constraint T < resolved TS is not restrictive in practice.

Known limitations

Pebble RANGEDELs (non-MVCC): ClearRange operations produce Pebble-level range deletions with no MVCC suffix. These pass through the suffix filter and shadow keys that should be visible at T. This is a pre-existing limitation of any MVCC time-travel mechanism — RevertRange can't undo a ClearRange either, since the data is physically gone rather than tombstoned at a higher MVCC timestamp. Not a practical concern for PCR or fork use cases where destructive non-MVCC operations on the tenant's span are not expected.

MVCC range keys (range tombstones): Must also be filtered. An unfiltered MVCC range tombstone at T+5 would shadow committed point keys at T-3 that should be live after revert — this is a correctness issue, not a deferrable optimization. The reason TBI disabled range key filtering (issue #86260) was specific to the dual-iterator pattern in MVCCIncrementalIterator — the TBI and main iterator seeing different range key fragmentation. That concern does not apply here: the SST-level bound is a property of the file (not an iterator option), there is no dual iterator, and the bound is applied uniformly to all SSTs. All iterators see the same filtered range keys.

Implementation: the range key iterator within each SST must skip range key entries (RangeKeySet, RangeKeyUnset, RangeKeyDelete) whose suffix compares > T via ComparePointSuffixes. Range key entries with no suffix or with suffix ≤ T pass through unfiltered.

Existing Building Blocks in Pebble

1. Per-SST Transforms via blockiter.Transforms

pebble/sstable/blockiter/transforms.go:20-30

Already flows from TableMetadataReaderSingleLevelIteratorDataBlockIter. Contains SyntheticSeqNum, HideObsoletePoints, SyntheticPrefixAndSuffix. This is where a new "suffix upper bound" would live.

2. HideObsoletePoints — The Exact Skipping Pattern We'd Replicate

pebble/sstable/colblk/data_block.go:1680,1845-1858

Uses a per-block bitmap (isObsolete) and SeekUnsetBitGE/SeekSetBitGE to efficiently skip rows in columnar blocks. Block-level filtering via obsoleteKeyBlockPropertyFilter skips entire blocks where all points are obsolete. Two-level filtering (block + row) is already established.

3. cockroachKeySeeker — Direct Columnar Access to MVCC Timestamps

pebble/cockroachkvs/cockroachkvs.go:765-799

Has mvccWallTimes colblk.UnsafeUints and mvccLogical colblk.UnsafeUints as decoded columns. Filtering would be two uint64 comparisons per row — no key materialization, no byte slicing. Drastically cheaper than the current SkipPoint callback which operates on materialized key bytes.

4. SyntheticPrefixAndSuffix on TableMetadata — Manifest Persistence Model

pebble/internal/manifest/table_metadata.go:204 pebble/internal/manifest/version_edit.go:91 (customTagSyntheticSuffix = 68)

Already stores per-SST suffix transforms in the manifest with custom tags. A new upper bound field would use the same pattern (new custom tag, new field).

5. Block Property Filters — Block-Level MVCC Filtering

pebble/cockroachkvs/blockproperties.go:15-87

MVCCTimeInterval block property collector tracks [min_walltime, max_walltime] per block. NewMVCCTimeIntervalFilter already filters blocks by time range. The existing SyntheticSuffixIntersects method handles suffix replacement.

6. Virtual SSTs — Per-SST Bounds Over Shared Physical Files

pebble/internal/manifest/table_metadata.go:199-200

Virtual SSTs already support per-file key bounds, synthetic prefix/suffix, and are created via single VersionEdit operations. Splitting an SST that straddles a tenant boundary is already a solved problem.

Proposed Implementation Layers

Layer 1: Manifest Metadata

  • New field on TableMetadata: MVCCUpperBound (wall time + logical as uint64s, or encoded MVCC suffix bytes)
  • New custom tag in version_edit.go (next available, ~70)
  • Applied via VersionEdit with DeletedTables + NewTables swapping the same backing file but adding the bound
  • For SSTs straddling tenant boundaries: first virtualize, then apply bound to the tenant-span virtual SST only

Layer 2: Block-Level Filtering

  • When MVCCUpperBound is set on a table, automatically construct an MVCCTimeIntervalFilter(0, upperBound.WallTime+1) and attach it as a PointKeyFilter on every iterator opened on that SST
  • This is NOT via IterOptions.PointKeyFilters (which is per-iterator from the caller) but via the reader/file_cache layer that already handles per-SST transforms
  • Entire blocks where min_walltime > upperBound.WallTime are skipped without loading
  • Blocks where max_walltime <= upperBound.WallTime need no row-level filtering

Layer 3: Columnar Row-Level Filtering in cockroachKeySeeker

  • New field in Transforms: something like SuffixUpperBound []byte (encoded MVCC timestamp) or a pair of (wallTime uint64, logical uint32)
  • In cockroachKeySeeker, during iteration: compare mvccWallTimes.At(row) and mvccLogical.At(row) against the bound
  • Skip rows exceeding the bound using the same pattern as HideObsoletePoints
  • Two options for the skip mechanism:
    • Option A: Build a bitmap during block init (like isObsolete). Pro: amortized cost, uses existing SeekSetBitGE infrastructure. Con: O(rows) init cost per block.
    • Option B: Check per-row inline. Pro: no init cost, only pays for rows actually visited. Con: branch per row in the hot path.
    • Recommendation: Option B for initial implementation (simpler, and the block-level filter already skips most blocks entirely), with Option A as a future optimization if profiling shows benefit.

Layer 4: Range Key Filtering

  • Range key entries (RangeKeySet, RangeKeyUnset, RangeKeyDelete) with MVCC suffix > T must be filtered out by the range key iterator within each SST
  • Use ComparePointSuffixes on the entry's suffix vs the bound
  • Entries with no suffix pass through (non-MVCC range keys)
  • The TBI fragmentation concern (#86260) does not apply: that was specific to the dual-iterator pattern where TBI and main iterator saw different fragments. Here the bound is a file property — all iterators see the same filtered view.
  • Block-level range key filtering via RangeKeyFilters on block properties can also be enabled (TBI explicitly disabled these, but we don't have the same constraint)

Layer 5: Compaction Behavior

  • During compaction, keys masked by MVCCUpperBound should be physically dropped (not written to output SST), reclaiming space
  • Output SSTs from such compactions don't inherit the bound (keys are gone)
  • This is the same semantic as HideObsoletePoints during compaction
  • Space reclamation happens asynchronously as compactions run after cutover

Raft Integration and MVCC Stats

Why Raft

The Pebble manifest edit must happen on all replicas of each range consistently. Different replicas have different physical LSM layouts (compaction histories, SST boundaries), but the Raft command specifies the logical intent: "apply MVCCUpperBound=T to all SSTs overlapping span [X, Y)." Each replica independently identifies its own SSTs and performs the manifest edit. This is analogous to how ClearRange works — Raft specifies the key range, each replica applies it.

Stats handling

After applying the bound, the range's MVCC stats are stale — we've hidden keys without computing the exact delta. Computing the exact delta would require iterating every masked key, defeating the purpose of the feature.

Instead, mark stats as estimated via MVCCStats.ContainsEstimates. This is set during eval (before Raft proposal) as part of the stats delta in the Raft log entry. Stats will be corrected eventually by the consistency checker or MVCC GC.

Integration: flag on RevertRange

No new range command needed. Add a flag (e.g., UsePebbleMVCCMask bool) to the existing RevertRange request. RevertRange already has exactly the right semantics — span, target timestamp, goes through Raft, existing callers (PCR cutover) already use it.

  • Flag off (default): Current behavior. Eval iterates via MVCCClearTimeRange, produces a WriteBatch of tombstones with exact stats deltas. Standard Raft proposal — each replica applies the batch.
  • Flag on: Eval does NOT produce a WriteBatch. Instead it produces a different Raft proposal type that carries just the span + timestamp + stats delta (ContainsEstimates = 1). On Raft apply, each replica calls Pebble.ApplyMVCCUpperBound(span, T) — similar to how splits/merges have special proposal handling beyond just applying a batch.

The Pebble API handles the memtable flush internally — the contract is "install this MVCC time mask on span X" and the fact that some data might be in memtables vs SSTs is a Pebble implementation concern. The flush ensures all data (including intent resolution writes) is in SSTs before the VersionEdit attaches the bound.

Operational Flows

PCR Cutover

  1. Quiesce workload on the tenant's spans (already done today)
  2. RevertRange(span, T, UsePebbleMVCCMask=true) — fans out to all ranges
  3. Each range: eval marks stats estimated → Raft → apply calls Pebble.ApplyMVCCUpperBound(span, T)
  4. Done. All reads now see data as-of time T. No iteration, no tombstones.
  5. Background: Compactions gradually physically drop masked keys

Note: No intent resolution step needed — PCR data is from AddSSTable (no intents).

Cluster Fork (Global Keyspace Only)

  1. Pick T < QueryResolvedTimestamp(tenant span) — guarantees no unresolved intents at or below T
  2. RevertRange(span, T, UsePebbleMVCCMask=true) — same Raft flow as PCR
  3. Copy/snapshot the SSTs — fork gets a consistent point-in-time view of the global keyspace at T. No txn records needed, no intent resolution on the fork.
  4. Background: Compactions on the source reclaim space from masked keys

Key Files to Modify

Pebble (../pebble/):

  • sstable/blockiter/transforms.go — add SuffixUpperBound field to Transforms
  • internal/manifest/table_metadata.go — add MVCCUpperBound field
  • internal/manifest/version_edit.go — add custom tag, encode/decode
  • cockroachkvs/cockroachkvs.go — add row-level filtering in cockroachKeySeeker
  • file_cache.go — attach block property filter when MVCCUpperBound is set
  • sstable/colblk/data_block.go — integrate SuffixUpperBound check in Next/SeekGE (likely minimal changes if done in cockroachKeySeeker layer)
  • compaction.go — ensure compaction respects/drops masked keys

CockroachDB (pkg/):

  • pkg/storage/ — API to apply upper bounds to SSTs via Pebble
  • pkg/kv/kvserver/ — new RPC or store method for "apply time bound to tenant span"
  • pkg/ccl/streamingccl/ — integrate into PCR cutover flow

Open Questions

  1. Abstraction level: Should SuffixUpperBound be generic in Pebble (any suffix) or MVCC-specific in cockroachkvs? The cockroachKeySeeker optimization requires MVCC awareness, but the Transforms field could be generic []byte with the KeySeeker interpreting it.

  2. Block property filter attachment: Per-SST filters are currently only set via IterOptions. Need to thread them through file_cache.go similar to how obsoleteKeyBlockPropertyFilter is attached for HideObsoletePoints.

  3. Stats invalidation: After applying an upper bound, table stats become stale. Need a mechanism to mark stats as needing recalculation.

  4. Interaction with snapshots: Pebble snapshots pin sequence numbers. An upper bound is timestamp-based, not seqnum-based. These are orthogonal but need to compose correctly.

  5. Testing strategy: The HideObsoletePoints testing infrastructure (randomized transforms in data_block_test.go:375) provides a good model. Would need equivalent randomized testing for suffix upper bounds.

Verification Plan

  1. Unit tests in cockroachkvs/ verifying row-level filtering in cockroachKeySeeker with various suffix types (MVCC, untyped, empty)
  2. SSTable reader tests verifying block-level + row-level filtering compose correctly with other transforms (synthetic suffix, hide obsolete points)
  3. Manifest round-trip tests for the new custom tag
  4. Integration test: Create SST with keys at various timestamps, apply upper bound, verify only expected keys are visible
  5. Compaction test: Verify masked keys are physically dropped during compaction
  6. CRDB-level test: Apply upper bound to a range, verify reads return point-in-time data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment