SLTS : Log Structured Snapshotting Metadata Store

Bootstrap

During Segment Store startup container metadata and storage metadata table segments are added as pinned segments inside SLTS.
During startup of the secondary services. Chunks for metadata and storage metadata table segments are discovered and metadata about them is populated in storage metadata segment. (This is done in order to avoid circular dependency.)

Why SLTS needs System Journal?

SLTS Storage System Segments.

SLTS stores metadata about all segments and their chunks in special metadata table segments. In this document we refer them as "Storage System Segments". There are two such segments

Container Metadata Segment (and it's attribute segment)
Storage Metadata Segment (and it's attribute segment)

Storage System Segments are table segments

Note that Storage System segments are table segments and as such the data in them is regularly flushed to LTS.
This means sometimes to read KV pair additional read requests are issued to fetch pages from those table segments.

Metadata about the Storage System Segments themselves creates circular dependency.

Each Storage operation must use metadata stored in Storage System Segments to fulfill the read, write or any other requests on any segment in the system.
That also includes the reads and write of the Storage System Segments themselves.
However the metadata about the Storage System Segments can not be stored in those segments themselves. It must be stored somewhere else. Otherwise it creates a circular dependency.

The solution : Log Structured Snapshotting Metadata Store

The solution is to pin metadata about Storage System Segments to memory and journal the changes to those records.

SLTS pins metadata about Storage System Segments to memory - which means it is never evicted from memory.
The metadata about the storage system segments is thus never written to those segments themselves.
In order to not loose changes when segment store is restarted and when container is relocated or otherwise tolerate any such failures, the changes to in-memory metadata are written to journal files as change log records.

Journals are on per container basis.

Each container has separate journal.

To recreate in-memory state of Storage System Segment metadata, journal records are replayed at boot time.

During the SLTS startup all journal records are applied sequentially to recreate in-memory state.

For efficient replay of the records periodically Snapshot records are created.

The snapshot records record the state of Storage System Segments at given time.
This means if we start from the known snapshot then only the journal records which are created after snapshot need be applied.

Where do we store pointer to latest snapshot?

Snapshot info store.

The pointer to latest snapshot is stored in Snapshot info store.

Snapshot info store implementation.

The snapshot info is stored as core attributes of container metadata segment.
The core attributes themselves are saved via book keeper

Key aspects of design.

journal records are written sequentially.
New journal files are started when
- When there is a failure during write
- Journal object reaches the size limit
- New snapshot is created
With no append mode each journal record is written to its own file.
With append mode new records are appended to the same active journal.
Snapshot records are created after any of threshold below is reached first
- fixed number of records are written to the journal (Default value is after every 100 records)
- or after fixed time interval since last snapshot.(Default value is after every 5 minutes).

How the boot algorithm works.

Find the latest valid snapshot by reading from Snapshot info store.
Sequentially read and apply all the changes after latest snapshot. Starting from the last snapshot, read and apply each journal file sequentially to this snapshot
Apply final truncation changes
If applicable reconcile the length of last chunk for each storage system segment by comparing it
Validate the changes and create new snapshot (Check that none of the invariants are broken and all data is consistent. Optionally, for each system segment check that all chunks actually exists on the LTS.)

Types of changes journaled

Journal Records

Addition of chunk

Following info is included in the journal record

Name of the segment.
Name of the new chunk added.
Offset at which the chunk is added.
Name of the previous last chunk.

Truncation of segment

Following info is included in the journal record

Name of the segment.
Name of the new first chunk.
Offset inside the chunk at which valid data starts
New start offset for segment

Snapshot record

Each snapshot record is saved in a separate file/object. Following info is included in the record

Epoch
Last journal file before snapshot is created.
For each system segment
- Segment Record
- List of chunk records

Snapshot info Record Check point record is simply the pointer to the latest/most recent snapshot record for given time interval. Stored in Following info is included in the record

epoch
id of the checkpoint record.

Key assumptions.

Each instance only writes to its own journal files only.
The epoch is monotonically increasing number. (Newer instances have higher epoch)
The time interval between two snapshots is several orders of magnitude larger time interval than possible clock skew. (see https://www.eecis.udel.edu/~mills/ntp/html/discipline.html See note below)
The time interval between two snapshots is order of magnitude larger than time taken by container to boot.
Chunk naming scheme is such that
- No two journal chunks can have same name
- names include container id
- names include epoch of instance writing the journal.
- names include monotonically increasing numbers for.

Note on clock skew

According to this document https://www.eecis.udel.edu/~mills/ntp/html/discipline.html

If left running continuously, an NTP client on a fast LAN in a home or office environment can maintain synchronization nominally within one millisecond.

To minimize clock skew and have smoothly functioning K8s cluster most k8s systems have npt installed on the host node. (verify)

Failure scenarios.

Failure during reads.

The read operation is retried fixed number of times (with fixed interval between the attempts)

Journal files created by Zombie instance

It is possible for Zombie instance to continue to write to old journal files.
The algorithm is able to detect such zombie records by detecting that valid owner has larger epoch number and has alternative history of change.
In such cases the history is reset to delete zombie changes and only valid changes are applied.

Partially written records

Any partially written records are ignored.
After each failure to write new journal is started.

Duplicate records

Duplicate records can be created when during retry where original request succeeds while writer encounters timeout and retries.
Upon encountering previously applied record again, such duplicate records are ignored.

Failure during write of snapshot and checkpoint.

Checkpoint is attempted fixed number of times. Each attempt gets a unique predictable name.
snapshot is not considered successful if the snapshot file can not be read back and validated.

sachin-j-joshi/SLTS-boot.md