- During Segment Store startup container metadata and storage metadata table segments are added as pinned segments inside SLTS.
- During startup of the secondary services. Chunks for metadata and storage metadata table segments are discovered and metadata about them is populated in storage metadata segment. (This is done in order to avoid circular dependency.)
SLTS stores metadata about all segments and their chunks in special metadata table segments. In this document we refer them as "Storage System Segments". There are two such segments
- Container Metadata Segment (and it's attribute segment)
- Storage Metadata Segment (and it's attribute segment)
- Note that Storage System segments are table segments and as such the data in them is regularly flushed to LTS.
- This means sometimes to read KV pair additional read requests are issued to fetch pages from those table segments.
- Each Storage operation must use metadata stored in Storage System Segments to fulfill the read, write or any other requests on any segment in the system.
- That also includes the reads and write of the Storage System Segments themselves.
- However the metadata about the Storage System Segments can not be stored in those segments themselves. It must be stored somewhere else. Otherwise it creates a circular dependency.
The solution is to pin metadata about Storage System Segments to memory and journal the changes to those records.
- SLTS pins metadata about Storage System Segments to memory - which means it is never evicted from memory.
- The metadata about the storage system segments is thus never written to those segments themselves.
- In order to not loose changes when segment store is restarted and when container is relocated or otherwise tolerate any such failures, the changes to in-memory metadata are written to journal files as change log records.
- Each container has separate journal.
To recreate in-memory state of Storage System Segment metadata, journal records are replayed at boot time.
- During the SLTS startup all journal records are applied sequentially to recreate in-memory state.
- The snapshot records record the state of Storage System Segments at given time.
- This means if we start from the known snapshot then only the journal records which are created after snapshot need be applied.
- The pointer to latest snapshot is stored in Snapshot info store.
- The snapshot info is stored as core attributes of container metadata segment.
- The core attributes themselves are saved via book keeper
- journal records are written sequentially.
- New journal files are started when
- When there is a failure during write
- Journal object reaches the size limit
- New snapshot is created
- With no append mode each journal record is written to its own file.
- With append mode new records are appended to the same active journal.
- Snapshot records are created after any of threshold below is reached first
- fixed number of records are written to the journal (Default value is after every 100 records)
- or after fixed time interval since last snapshot.(Default value is after every 5 minutes).
- Find the latest valid snapshot by reading from Snapshot info store.
- Sequentially read and apply all the changes after latest snapshot. Starting from the last snapshot, read and apply each journal file sequentially to this snapshot
- Apply final truncation changes
- If applicable reconcile the length of last chunk for each storage system segment by comparing it
- Validate the changes and create new snapshot (Check that none of the invariants are broken and all data is consistent. Optionally, for each system segment check that all chunks actually exists on the LTS.)
Addition of chunk
Following info is included in the journal record
- Name of the segment.
- Name of the new chunk added.
- Offset at which the chunk is added.
- Name of the previous last chunk.
Truncation of segment
Following info is included in the journal record
- Name of the segment.
- Name of the new first chunk.
- Offset inside the chunk at which valid data starts
- New start offset for segment
Snapshot record
Each snapshot record is saved in a separate file/object. Following info is included in the record
- Epoch
- Last journal file before snapshot is created.
- For each system segment
- Segment Record
- List of chunk records
Snapshot info Record Check point record is simply the pointer to the latest/most recent snapshot record for given time interval. Stored in Following info is included in the record
- epoch
- id of the checkpoint record.
- Each instance only writes to its own journal files only.
- The epoch is monotonically increasing number. (Newer instances have higher epoch)
- The time interval between two snapshots is several orders of magnitude larger time interval than possible clock skew. (see https://www.eecis.udel.edu/~mills/ntp/html/discipline.html See note below)
- The time interval between two snapshots is order of magnitude larger than time taken by container to boot.
- Chunk naming scheme is such that
- No two journal chunks can have same name
- names include container id
- names include epoch of instance writing the journal.
- names include monotonically increasing numbers for.
According to this document https://www.eecis.udel.edu/~mills/ntp/html/discipline.html
If left running continuously, an NTP client on a fast LAN in a home or office environment can maintain synchronization nominally within one millisecond.
To minimize clock skew and have smoothly functioning K8s cluster most k8s systems have npt installed on the host node. (verify)
- The read operation is retried fixed number of times (with fixed interval between the attempts)
- It is possible for Zombie instance to continue to write to old journal files.
- The algorithm is able to detect such zombie records by detecting that valid owner has larger epoch number and has alternative history of change.
- In such cases the history is reset to delete zombie changes and only valid changes are applied.
- Any partially written records are ignored.
- After each failure to write new journal is started.
- Duplicate records can be created when during retry where original request succeeds while writer encounters timeout and retries.
- Upon encountering previously applied record again, such duplicate records are ignored.
- Checkpoint is attempted fixed number of times. Each attempt gets a unique predictable name.
- snapshot is not considered successful if the snapshot file can not be read back and validated.