PCR cutover today uses RevertRange RPCs that iterate every key in the cluster
(modulo TBI block skipping) to find and tombstone writes newer than time T. This
is O(changed data) in both reads and writes. The idea: attach an MVCC upper time
bound to each SST via a single manifest edit, instantly masking all keys > T from
all iterators. No loading blocks, no writing tombstones, no compaction of those
tombstones — just one manifest write per store.
A secondary use case — cluster fork — motivates designing the feature to work on SSTs from live traffic (with intents), not just PCR's intent-free AddSSTable data.
-
T < QueryResolvedTimestamp(span): The caller picks T below the resolved timestamp for the tenant's key span. The resolved timestamp guarantees that all intents at or below T have been fully resolved — lock table keys removed, provisional values either promoted to their commit timestamp or deleted.
-
Memtable flush after picking T: All memtables on all stores are flushed before applying the SST-level bound. This ensures intent resolution writes (which were applied via Raft before the resolved timestamp could reach T) are materialized in SSTs, not stranded in memtables.
-
MVCCUpperBound = T applied to every SST overlapping the span: Applied uniformly via a single
VersionEditper store. SSTs straddling the span boundary are first virtualized, with the bound applied only to the virtual SST covering the tenant's portion.
The MVCCUpperBound is a per-SST transform (like HideObsoletePoints or
SyntheticSuffix) that masks point keys whose MVCC timestamp suffix is > T.
It operates at two levels:
- Block level: Existing
MVCCTimeIntervalblock properties track[min_walltime, max_walltime]per block. Entire blocks wheremin_walltime > T.WallTimeare skipped without loading. - Row level: In the
cockroachKeySeeker,mvccWallTimes.At(row)andmvccLogical.At(row)are compared directly against T — two uint64 comparisons, no key materialization. Rows exceeding T are skipped.
Keys with non-MVCC suffixes (lock table keys with 17-byte version, unversioned meta keys) are never filtered — the suffix type check excludes them.
By precondition (1), T < resolved TS means no unresolved intents exist at or below T. Therefore every MVCC point key at timestamp ≤ T in the global keyspace is a definitively committed value. No provisional values with ambiguous commit status exist at ≤ T — they've all been resolved:
- Committed, no timestamp push: Lock table key deleted. Provisional value at write timestamp stays as the committed value. It's at ≤ T and visible through the filter. Correct.
- Committed, timestamp pushed from T-5 to T-2 (both ≤ T): Intent resolution wrote a DEL at K@T-5 and a SET at K@T-2. Both are at ≤ T, both visible. The DEL shadows the stale provisional. The committed value at T-2 is visible. Correct.
- Committed, timestamp pushed from T-5 to T+3 (crosses T): Intent resolution wrote a DEL at K@T-5 and a SET at K@T+3. The DEL at T-5 is visible (≤ T) and shadows the stale provisional. The SET at T+3 is masked (> T). Net: key not visible. Correct — the transaction's committed timestamp is after T.
- Aborted: Intent resolution wrote a DEL at the provisional value's timestamp and deleted the lock table key. The DEL shadows the stale provisional. Key not visible. Correct.
Intent resolution is async and its writes often land in a different SST than the original provisional value. This is fine because the merge iterator handles cross-SST interactions:
- SST A (flushed early): contains stale provisional value K@T-5 (a SET)
- SST B (flushed later, after resolution): contains DEL at K@T-5
The DEL in SST B has a higher sequence number than the SET in SST A. The merge iterator correctly resolves the DEL over the SET. Both keys have the same MVCC suffix (T-5), so both pass or fail the time bound filter identically. The shadowing works regardless of which SSTs the keys land in.
Precondition (2) — flushing memtables — guarantees that all resolution writes are in SSTs. The resolved timestamp could not have reached T unless the resolution Raft commands were applied, which means the resolution writes are in the engine. After flush, they're in SSTs and participate in the merge.
The MVCCIncrementalIterator uses a dual-iterator pattern (filtered TBI +
unfiltered main iterator) for two reasons:
- TBI was historically an optimization hint, not a correctness guarantee — the unfiltered iterator was the source of truth.
MVCCClearTimeRangeneeds to see older versions below the cleared range to compute correct MVCC stats when writing tombstones (mvcc.go:3394-3405).
Neither applies here:
- The SST-level bound is a property of the file, not an iterator option.
There is no "unfiltered view" — the file's metadata says keys above T don't
exist, the same way virtual SST bounds or
HideObsoletePointswork. All iterators see the same filtered view. - No tombstones are written, so no stats delta computation is needed. Stats can be recalculated lazily or during compaction.
- The merge iterator composes per-SST filtered views naturally, identical to how virtual SSTs with different bounds already work in Pebble.
Lock table keys have 17-byte suffixes (lock strength + txn UUID), not MVCC
timestamp suffixes (8, 12, or 13 bytes). The cockroachKeySeeker classifies
rows by suffixTypes; the MVCC upper bound check only applies to rows with MVCC
timestamp suffixes. Lock table keys pass through unfiltered.
By precondition (1), all intents at ≤ T have been resolved, so any remaining lock table keys are either:
- For intents at > T (their lock table DELs and provisional value DELs are also at > T and may or may not be visible, but the intent itself is above our snapshot point — irrelevant)
- Already shadowed by their resolution DELs in a newer SST
Transaction records are range-local keys at
\x01k{anchorKey}{txnSuffix}{txnID} (keys/keys.go:544-546), stored outside
the tenant's global key span [/Tenant/N/, /Tenant/N+1/).
For the cluster fork use case (which copies only global keyspace), txn records are not available. This is acceptable because precondition (1) — T < resolved TS — guarantees no unresolved intents exist at ≤ T. There is nothing to resolve, so txn records are never consulted.
The resolved timestamp is not persisted. It is computed as
min(closedTS, oldest_unresolved_intent_timestamp - 1). Two ways to obtain it:
- Active rangefeed:
ScheduledProcessormaintains it in-memory per range, updated as Raft applies intent write/commit/abort ops. Published viaRangeFeedCheckpoint. State: per-processor, in-memory, not persisted, reconstructed on rangefeed start by scanning the lock table. - On-demand:
QueryResolvedTimestampRPC (cmd_query_resolved_timestamp.go:43-88) computes it by scanning the lock table and combining with the closed timestamp. No active rangefeed needed.
The resolved timestamp typically trails "now" by seconds (bounded by the oldest active transaction's write timestamp), so the constraint T < resolved TS is not restrictive in practice.
Pebble RANGEDELs (non-MVCC): ClearRange operations produce Pebble-level
range deletions with no MVCC suffix. These pass through the suffix filter and
shadow keys that should be visible at T. This is a pre-existing limitation of
any MVCC time-travel mechanism — RevertRange can't undo a ClearRange either,
since the data is physically gone rather than tombstoned at a higher MVCC
timestamp. Not a practical concern for PCR or fork use cases where destructive
non-MVCC operations on the tenant's span are not expected.
MVCC range keys (range tombstones): Must also be filtered. An unfiltered
MVCC range tombstone at T+5 would shadow committed point keys at T-3 that should
be live after revert — this is a correctness issue, not a deferrable
optimization. The reason TBI disabled range key filtering (issue #86260) was
specific to the dual-iterator pattern in MVCCIncrementalIterator — the TBI and
main iterator seeing different range key fragmentation. That concern does not
apply here: the SST-level bound is a property of the file (not an iterator
option), there is no dual iterator, and the bound is applied uniformly to all
SSTs. All iterators see the same filtered range keys.
Implementation: the range key iterator within each SST must skip range key
entries (RangeKeySet, RangeKeyUnset, RangeKeyDelete) whose suffix compares > T
via ComparePointSuffixes. Range key entries with no suffix or with suffix ≤ T
pass through unfiltered.
pebble/sstable/blockiter/transforms.go:20-30
Already flows from TableMetadata → Reader → SingleLevelIterator → DataBlockIter.
Contains SyntheticSeqNum, HideObsoletePoints, SyntheticPrefixAndSuffix.
This is where a new "suffix upper bound" would live.
pebble/sstable/colblk/data_block.go:1680,1845-1858
Uses a per-block bitmap (isObsolete) and SeekUnsetBitGE/SeekSetBitGE to
efficiently skip rows in columnar blocks. Block-level filtering via
obsoleteKeyBlockPropertyFilter skips entire blocks where all points are
obsolete. Two-level filtering (block + row) is already established.
pebble/cockroachkvs/cockroachkvs.go:765-799
Has mvccWallTimes colblk.UnsafeUints and mvccLogical colblk.UnsafeUints as
decoded columns. Filtering would be two uint64 comparisons per row — no key
materialization, no byte slicing. Drastically cheaper than the current SkipPoint
callback which operates on materialized key bytes.
pebble/internal/manifest/table_metadata.go:204
pebble/internal/manifest/version_edit.go:91 (customTagSyntheticSuffix = 68)
Already stores per-SST suffix transforms in the manifest with custom tags. A new upper bound field would use the same pattern (new custom tag, new field).
pebble/cockroachkvs/blockproperties.go:15-87
MVCCTimeInterval block property collector tracks [min_walltime, max_walltime]
per block. NewMVCCTimeIntervalFilter already filters blocks by time range.
The existing SyntheticSuffixIntersects method handles suffix replacement.
pebble/internal/manifest/table_metadata.go:199-200
Virtual SSTs already support per-file key bounds, synthetic prefix/suffix, and
are created via single VersionEdit operations. Splitting an SST that straddles
a tenant boundary is already a solved problem.
- New field on
TableMetadata:MVCCUpperBound(wall time + logical as uint64s, or encoded MVCC suffix bytes) - New custom tag in
version_edit.go(next available, ~70) - Applied via
VersionEditwithDeletedTables+NewTablesswapping the same backing file but adding the bound - For SSTs straddling tenant boundaries: first virtualize, then apply bound to the tenant-span virtual SST only
- When
MVCCUpperBoundis set on a table, automatically construct anMVCCTimeIntervalFilter(0, upperBound.WallTime+1)and attach it as aPointKeyFilteron every iterator opened on that SST - This is NOT via
IterOptions.PointKeyFilters(which is per-iterator from the caller) but via the reader/file_cache layer that already handles per-SST transforms - Entire blocks where
min_walltime > upperBound.WallTimeare skipped without loading - Blocks where
max_walltime <= upperBound.WallTimeneed no row-level filtering
- New field in
Transforms: something likeSuffixUpperBound []byte(encoded MVCC timestamp) or a pair of(wallTime uint64, logical uint32) - In
cockroachKeySeeker, during iteration: comparemvccWallTimes.At(row)andmvccLogical.At(row)against the bound - Skip rows exceeding the bound using the same pattern as
HideObsoletePoints - Two options for the skip mechanism:
- Option A: Build a bitmap during block init (like
isObsolete). Pro: amortized cost, uses existingSeekSetBitGEinfrastructure. Con: O(rows) init cost per block. - Option B: Check per-row inline. Pro: no init cost, only pays for rows actually visited. Con: branch per row in the hot path.
- Recommendation: Option B for initial implementation (simpler, and the block-level filter already skips most blocks entirely), with Option A as a future optimization if profiling shows benefit.
- Option A: Build a bitmap during block init (like
- Range key entries (RangeKeySet, RangeKeyUnset, RangeKeyDelete) with MVCC suffix > T must be filtered out by the range key iterator within each SST
- Use
ComparePointSuffixeson the entry's suffix vs the bound - Entries with no suffix pass through (non-MVCC range keys)
- The TBI fragmentation concern (#86260) does not apply: that was specific to the dual-iterator pattern where TBI and main iterator saw different fragments. Here the bound is a file property — all iterators see the same filtered view.
- Block-level range key filtering via
RangeKeyFilterson block properties can also be enabled (TBI explicitly disabled these, but we don't have the same constraint)
- During compaction, keys masked by
MVCCUpperBoundshould be physically dropped (not written to output SST), reclaiming space - Output SSTs from such compactions don't inherit the bound (keys are gone)
- This is the same semantic as
HideObsoletePointsduring compaction - Space reclamation happens asynchronously as compactions run after cutover
The Pebble manifest edit must happen on all replicas of each range consistently.
Different replicas have different physical LSM layouts (compaction histories, SST
boundaries), but the Raft command specifies the logical intent: "apply
MVCCUpperBound=T to all SSTs overlapping span [X, Y)." Each replica independently
identifies its own SSTs and performs the manifest edit. This is analogous to how
ClearRange works — Raft specifies the key range, each replica applies it.
After applying the bound, the range's MVCC stats are stale — we've hidden keys without computing the exact delta. Computing the exact delta would require iterating every masked key, defeating the purpose of the feature.
Instead, mark stats as estimated via MVCCStats.ContainsEstimates. This is set
during eval (before Raft proposal) as part of the stats delta in the Raft log
entry. Stats will be corrected eventually by the consistency checker or MVCC GC.
No new range command needed. Add a flag (e.g., UsePebbleMVCCMask bool) to the
existing RevertRange request. RevertRange already has exactly the right
semantics — span, target timestamp, goes through Raft, existing callers (PCR
cutover) already use it.
- Flag off (default): Current behavior. Eval iterates via
MVCCClearTimeRange, produces aWriteBatchof tombstones with exact stats deltas. Standard Raft proposal — each replica applies the batch. - Flag on: Eval does NOT produce a
WriteBatch. Instead it produces a different Raft proposal type that carries just the span + timestamp + stats delta (ContainsEstimates = 1). On Raft apply, each replica callsPebble.ApplyMVCCUpperBound(span, T)— similar to how splits/merges have special proposal handling beyond just applying a batch.
The Pebble API handles the memtable flush internally — the contract is "install
this MVCC time mask on span X" and the fact that some data might be in memtables
vs SSTs is a Pebble implementation concern. The flush ensures all data (including
intent resolution writes) is in SSTs before the VersionEdit attaches the bound.
- Quiesce workload on the tenant's spans (already done today)
- RevertRange(span, T, UsePebbleMVCCMask=true) — fans out to all ranges
- Each range: eval marks stats estimated → Raft → apply calls
Pebble.ApplyMVCCUpperBound(span, T) - Done. All reads now see data as-of time T. No iteration, no tombstones.
- Background: Compactions gradually physically drop masked keys
Note: No intent resolution step needed — PCR data is from AddSSTable (no intents).
- Pick T < QueryResolvedTimestamp(tenant span) — guarantees no unresolved intents at or below T
- RevertRange(span, T, UsePebbleMVCCMask=true) — same Raft flow as PCR
- Copy/snapshot the SSTs — fork gets a consistent point-in-time view of the global keyspace at T. No txn records needed, no intent resolution on the fork.
- Background: Compactions on the source reclaim space from masked keys
Pebble (../pebble/):
sstable/blockiter/transforms.go— addSuffixUpperBoundfield toTransformsinternal/manifest/table_metadata.go— addMVCCUpperBoundfieldinternal/manifest/version_edit.go— add custom tag, encode/decodecockroachkvs/cockroachkvs.go— add row-level filtering incockroachKeySeekerfile_cache.go— attach block property filter whenMVCCUpperBoundis setsstable/colblk/data_block.go— integrateSuffixUpperBoundcheck in Next/SeekGE (likely minimal changes if done in cockroachKeySeeker layer)compaction.go— ensure compaction respects/drops masked keys
CockroachDB (pkg/):
pkg/storage/— API to apply upper bounds to SSTs via Pebblepkg/kv/kvserver/— new RPC or store method for "apply time bound to tenant span"pkg/ccl/streamingccl/— integrate into PCR cutover flow
-
Abstraction level: Should
SuffixUpperBoundbe generic in Pebble (any suffix) or MVCC-specific in cockroachkvs? ThecockroachKeySeekeroptimization requires MVCC awareness, but the Transforms field could be generic[]bytewith the KeySeeker interpreting it. -
Block property filter attachment: Per-SST filters are currently only set via
IterOptions. Need to thread them throughfile_cache.gosimilar to howobsoleteKeyBlockPropertyFilteris attached forHideObsoletePoints. -
Stats invalidation: After applying an upper bound, table stats become stale. Need a mechanism to mark stats as needing recalculation.
-
Interaction with snapshots: Pebble snapshots pin sequence numbers. An upper bound is timestamp-based, not seqnum-based. These are orthogonal but need to compose correctly.
-
Testing strategy: The
HideObsoletePointstesting infrastructure (randomized transforms indata_block_test.go:375) provides a good model. Would need equivalent randomized testing for suffix upper bounds.
- Unit tests in
cockroachkvs/verifying row-level filtering incockroachKeySeekerwith various suffix types (MVCC, untyped, empty) - SSTable reader tests verifying block-level + row-level filtering compose correctly with other transforms (synthetic suffix, hide obsolete points)
- Manifest round-trip tests for the new custom tag
- Integration test: Create SST with keys at various timestamps, apply upper bound, verify only expected keys are visible
- Compaction test: Verify masked keys are physically dropped during compaction
- CRDB-level test: Apply upper bound to a range, verify reads return point-in-time data