coderplay/0-flink-checkpointing.md

Last active July 21, 2025 04:45

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/coderplay/8b55caaf32181d593e809724e66dfd7a.js"></script>
Save coderplay/8b55caaf32181d593e809724e66dfd7a to your computer and use it in GitHub Desktop.

Download ZIP

Raw

0-flink-checkpointing.md

Flink Checkpointing: End-to-End Guide

Overview
Architecture Components
Checkpoint Lifecycle
Barrier Mechanism
State Snapshotting
Recovery Process
Configuration
Performance Tuning
Troubleshooting

Overview

Apache Flink's checkpointing mechanism provides fault tolerance by creating consistent snapshots of the application state at regular intervals. This enables the system to recover from failures while maintaining exactly-once processing guarantees.

Key Concepts

Checkpoint: An automatic snapshot of application state taken by Flink for fault tolerance
Savepoint: A manual snapshot triggered by users for operational purposes
State Backend: The storage mechanism for operator state during execution
Checkpoint Storage: The persistent storage location for checkpoint data
Barrier: Special markers that flow through the data stream to coordinate checkpoints

Architecture Components

graph TB
    subgraph "JobManager"
        CC["Checkpoint Coordinator"]
        CS["Checkpoint Storage"]
    end
    
    subgraph "TaskManager 1"
        T1["Task 1"]
        SB1["State Backend"]
        SC1["Subtask Checkpoint Coordinator"]
    end
    
    subgraph "TaskManager 2"
        T2["Task 2"]
        SB2["State Backend"]
        SC2["Subtask Checkpoint Coordinator"]
    end
    
    subgraph "External Storage"
        FS["File System / S3 / HDFS"]
    end
    
    CC --> SC1
    CC --> SC2
    SC1 --> SB1
    SC2 --> SB2
    CS --> FS
    
    style CC fill:#e1f5fe
    style CS fill:#e8f5e8
    style SB1 fill:#fff3e0
    style SB2 fill:#fff3e0

Component Responsibilities

Checkpoint Coordinator (JobManager):
- Triggers checkpoints at configured intervals
- Manages checkpoint lifecycle
- Coordinates barrier injection and acknowledgments
- Handles checkpoint completion and cleanup
Subtask Checkpoint Coordinator (TaskManager):
- Manages local checkpoint execution
- Coordinates state snapshotting
- Handles barrier alignment and processing
State Backend:
- Stores operator state during execution
- Provides snapshotting capabilities
- Manages state serialization/deserialization
Checkpoint Storage:
- Persists checkpoint metadata and data
- Provides recovery capabilities
- Manages checkpoint retention policies

Checkpoint Lifecycle

sequenceDiagram
    participant CC as Checkpoint Coordinator
    participant S as Source
    participant O as Operator
    participant Sink as Sink
    participant CS as Checkpoint Storage
    
    Note over CC: 1. Trigger Checkpoint
    CC->>S: Trigger Checkpoint (ID: n)
    
    Note over S: 2. Inject Barriers
    S->>S: Record offset position
    S->>O: Checkpoint Barrier (n)
    
    Note over O: 3. Barrier Processing
    O->>O: Wait for barriers from all inputs
    O->>O: Snapshot state
    O->>Sink: Forward Barrier (n)
    
    Note over Sink: 4. Acknowledgment
    Sink->>Sink: Receive all barriers
    Sink->>CC: Acknowledge Checkpoint (n)
    
    Note over CC: 5. Completion
    CC->>CS: Persist checkpoint metadata
    CC->>CC: Mark checkpoint complete

Detailed Steps

Trigger Phase:
- Checkpoint Coordinator initiates checkpoint with unique ID
- Triggers sent to all source operators
- Checkpoint metadata created
Barrier Injection:
- Sources record their current position/offset
- Checkpoint barriers injected into data streams
- Barriers carry checkpoint ID and timestamp
Barrier Propagation:
- Barriers flow downstream with data records
- Never overtake records (maintain order)
- Multiple barriers can coexist in stream
State Snapshotting:
- Operators snapshot state when all input barriers received
- Asynchronous state serialization to storage
- Barriers forwarded to downstream operators
Acknowledgment:
- Sink operators acknowledge checkpoint completion
- Checkpoint Coordinator collects all acknowledgments
- Checkpoint marked as complete

Barrier Mechanism

Aligned Checkpoints

graph LR
    subgraph "Input Streams"
        I1["Input 1: ...R1[B]R2..."]
        I2["Input 2: ...R3[B]R4..."]
    end
    
    subgraph "Operator"
        O["Operator<br/>Wait for barriers<br/>from all inputs"]
        Buffer["Input Buffer"]
    end
    
    subgraph "Output"
        Out["Output: ...R1,R3[B]R2,R4..."]
    end
    
    I1 --> Buffer
    I2 --> Buffer
    Buffer --> O
    O --> Out
    
    style O fill:#ffecb3
    style Buffer fill:#e3f2fd

Alignment Process:

Operator receives barrier from first input stream
Blocks that input channel until barriers arrive from all inputs
Buffers incoming records from blocked channels
Once all barriers received, processes buffered records
Takes state snapshot and forwards barriers

Unaligned Checkpoints

graph LR
    subgraph "Input Streams"
        I1["Input 1: ...R1[B]R2..."]
        I2["Input 2: ...R3,R4..."]
    end
    
    subgraph "Operator"
        O["Operator<br/>Process first barrier<br/>immediately"]
        InFlight["In-flight Data<br/>becomes part of state"]
    end
    
    subgraph "Output"
        Out["Output: ...R1[B]R3,R2..."]
    end
    
    I1 --> O
    I2 --> O
    O --> InFlight
    O --> Out
    
    style O fill:#c8e6c9
    style InFlight fill:#fff3e0

Unaligned Process:

Operator reacts to first barrier immediately
Forwards barrier downstream without waiting
In-flight data becomes part of operator state
Reduces checkpoint alignment time
Suitable for high-throughput scenarios

State Snapshotting

State Backend Types

graph TB
    subgraph "HashMapStateBackend"
        HMS["Heap Memory Storage"]
        HMSCP["Checkpoint to<br/>External Storage"]
    end
    
    subgraph "EmbeddedRocksDBStateBackend"
        RDB["RocksDB<br/>Local Disk"]
        RDBCP["Incremental<br/>Checkpoints"]
    end
    
    subgraph "External Storage"
        FS["FileSystem<br/>S3/HDFS/GCS"]
        Meta["Checkpoint<br/>Metadata"]
    end
    
    HMS --> HMSCP
    RDB --> RDBCP
    HMSCP --> FS
    RDBCP --> FS
    Meta --> FS
    
    style HMS fill:#e1f5fe
    style RDB fill:#fff3e0
    style FS fill:#e8f5e8

Snapshot Process

Synchronous Phase:
- Operator stops processing new records
- Creates consistent state snapshot
- Minimal blocking time
Asynchronous Phase:
- State serialization to external storage
- Operator continues processing
- Background I/O operations
Completion:
- Snapshot persisted successfully
- Acknowledgment sent to Checkpoint Coordinator
- Old state versions eligible for cleanup

Recovery Process

graph TB
    subgraph "Failure Detection"
        FD["Task Failure<br/>Detected"]
        RS["Restart Strategy<br/>Evaluation"]
    end
    
    subgraph "Recovery Planning"
        LC["Latest Checkpoint<br/>Selection"]
        RD["Recovery Decision<br/>Made"]
    end
    
    subgraph "State Restoration"
        SR["State Restoration<br/>from Checkpoint"]
        SO["Source Offset<br/>Reset"]
    end
    
    subgraph "Resumption"
        RP["Resume Processing<br/>from Checkpoint"]
        EO["Exactly-Once<br/>Guarantees"]
    end
    
    FD --> RS
    RS --> LC
    LC --> RD
    RD --> SR
    SR --> SO
    SO --> RP
    RP --> EO
    
    style FD fill:#ffcdd2
    style LC fill:#e8f5e8
    style SR fill:#fff3e0
    style RP fill:#c8e6c9

Recovery Steps

Failure Detection:
- TaskManager failure or task exception
- Restart strategy evaluation
- Recovery decision made
Checkpoint Selection:
- Latest completed checkpoint identified
- Checkpoint metadata retrieved
- Recovery plan created
State Restoration:
- Operators restored with checkpoint state
- Source positions reset to checkpoint offsets
- Topology redeployed
Processing Resumption:
- Data processing resumes from checkpoint point
- Exactly-once guarantees maintained
- Downstream systems remain consistent

Configuration

Basic Configuration

# Enable checkpointing
execution.checkpointing.interval: 60s

# Checkpoint mode
execution.checkpointing.mode: EXACTLY_ONCE

# Checkpoint timeout
execution.checkpointing.timeout: 10min

# Minimum pause between checkpoints
execution.checkpointing.min-pause: 5s

# Maximum concurrent checkpoints
execution.checkpointing.max-concurrent-checkpoints: 1

# Checkpoint cleanup
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION

# Unaligned checkpoints
execution.checkpointing.unaligned: false

State Backend Configuration

# State backend type
state.backend.type: rocksdb

# Checkpoint storage directory
state.checkpoints.dir: s3://my-bucket/checkpoints

# Savepoint directory
state.savepoints.dir: s3://my-bucket/savepoints

# Incremental checkpoints
state.backend.incremental: true

Programmatic Configuration

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Enable checkpointing every 60 seconds
env.enableCheckpointing(60000);

// Configure checkpoint mode
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

// Set minimum pause between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);

// Set checkpoint timeout
env.getCheckpointConfig().setCheckpointTimeout(600000);

// Allow only one checkpoint at a time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);

// Retain checkpoints on cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(
    ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

// Enable unaligned checkpoints
env.getCheckpointConfig().enableUnalignedCheckpoints();

// Set state backend
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));

// Set checkpoint storage
env.getCheckpointConfig().setCheckpointStorage("s3://my-bucket/checkpoints");

Performance Tuning

Checkpoint Interval Tuning

graph LR
    subgraph "Factors"
        RT["Recovery Time"]
        OH["Overhead"]
        SS["State Size"]
    end
    
    subgraph "Trade-offs"
        FI["Frequent Intervals<br/>• Lower recovery time<br/>• Higher overhead"]
        II["Infrequent Intervals<br/>• Higher recovery time<br/>• Lower overhead"]
    end
    
    subgraph "Optimization"
        OI["Optimal Interval<br/>Balance based on<br/>requirements"]
    end
    
    RT --> FI
    OH --> II
    SS --> OI
    FI --> OI
    II --> OI
    
    style OI fill:#c8e6c9

Performance Best Practices

Checkpoint Interval:
- Balance recovery time vs. overhead
- Consider state size and I/O capacity
- Monitor checkpoint duration
State Backend Selection:
- HashMapStateBackend for small state
- RocksDBStateBackend for large state
- Enable incremental checkpoints for RocksDB
Unaligned Checkpoints:
- Use for high-throughput scenarios
- Avoid when I/O is bottleneck
- Monitor checkpoint sizes
Storage Optimization:
- Use high-performance storage (SSD)
- Ensure sufficient I/O bandwidth
- Configure appropriate replication

Monitoring Metrics

# Key metrics to monitor
checkpoint.duration: "Time to complete checkpoint"
checkpoint.size: "Size of checkpoint data"
checkpoint.alignment_time: "Time waiting for barriers"
checkpoint.count: "Number of completed checkpoints"
checkpoint.failed_count: "Number of failed checkpoints"

Troubleshooting

Common Issues

Checkpoint Timeouts:
- Increase checkpoint timeout
- Reduce checkpoint interval
- Optimize state backend performance
High Checkpoint Duration:
- Enable incremental checkpoints
- Optimize serialization
- Increase parallelism
Barrier Alignment Issues:
- Enable unaligned checkpoints
- Address backpressure
- Optimize network configuration
Storage Issues:
- Ensure sufficient storage capacity
- Verify storage accessibility
- Monitor I/O performance

Debugging Steps

Check Checkpoint Metrics:

# Via REST API
curl http://jobmanager:8081/jobs/<job-id>/checkpoints

# Via Web UI
http://jobmanager:8081/#/job/<job-id>/checkpoints

Analyze Logs:

# TaskManager logs
grep -i checkpoint taskmanager.log

# JobManager logs
grep -i checkpoint jobmanager.log

Monitor Resource Usage:
- CPU utilization during checkpoints
- Memory usage patterns
- Disk I/O during snapshots
- Network bandwidth utilization

Recovery Troubleshooting

Checkpoint Corruption:
- Verify storage integrity
- Check for concurrent modifications
- Restore from earlier checkpoint
Incompatible State:
- Review state schema changes
- Use state migration strategies
- Consider savepoint compatibility
Recovery Failures:
- Check resource availability
- Verify checkpoint accessibility
- Review error logs for root cause

Best Practices Summary

Configuration:
- Set appropriate checkpoint intervals
- Choose suitable state backend
- Configure external checkpoint storage
Monitoring:
- Track checkpoint metrics
- Monitor storage performance
- Set up alerting for failures
Testing:
- Test recovery scenarios
- Validate checkpoint compatibility
- Perform chaos engineering
Operations:
- Implement backup strategies
- Document recovery procedures
- Regular checkpoint cleanup

This comprehensive guide covers the end-to-end checkpointing mechanism in Apache Flink, from basic concepts to advanced troubleshooting. The diagrams and examples provide practical insights for implementing and maintaining fault-tolerant Flink applications.

Raw

1-flink-checkpointing.md

Apache Flink Checkpoint Architecture: Technical Deep Dive

This document provides a comprehensive technical analysis of Apache Flink's checkpoint architecture, based on exploration of the codebase and detailed understanding of the implementation.

Executive Summary

Apache Flink's checkpoint mechanism implements the Chandy-Lamport distributed snapshot algorithm with several extensions to handle real-world streaming scenarios. The system creates consistent snapshots of distributed state while maintaining exactly-once processing semantics.

System Architecture Overview

graph TB
    subgraph "JobManager Process"
        CC[CheckpointCoordinator
        org.apache.flink.runtime.checkpoint.CheckpointCoordinator]
        CCS[CompletedCheckpointStore
        org.apache.flink.runtime.checkpoint.CompletedCheckpointStore]
        CIC[CheckpointIDCounter
        org.apache.flink.runtime.checkpoint.CheckpointIDCounter]
        CSD[CheckpointStorage
        org.apache.flink.runtime.state.CheckpointStorage]
    end
    
    subgraph "TaskManager Process"
        SCC[SubtaskCheckpointCoordinatorImpl
        org.apache.flink.runtime.tasks.SubtaskCheckpointCoordinatorImpl]
        SB[StateBackend
        org.apache.flink.runtime.state.StateBackend]
        BH[BarrierHandler
        org.apache.flink.runtime.io.checkpointing.CheckpointBarrierHandler]
        OP[OperatorChain
        org.apache.flink.runtime.tasks.OperatorChain]
    end
    
    subgraph "External Storage"
        FS[FileSystem: HDFS/S3]
        Meta[Checkpoint Metadata Files]
        Data[State Data Files]
    end
    
    CC -- "RPC messages" --> SCC
    CC -- "metadata" --> CCS
    CC -- "storage interface" --> CSD
    CSD -- "persist" --> FS
    SCC -- "snapshot" --> SB
    SB -- "write" --> FS
    FS -- "restore" --> SB

Core Components Deep Dive

1. Checkpoint Coordinator (JobManager)

The CheckpointCoordinator class (flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java) is the central orchestrator:

classDiagram
    class CheckpointCoordinator {
        -CheckpointCoordinatorConfiguration configuration
        -CheckpointStorage checkpointStorage
        -CompletedCheckpointStore completedCheckpointStore
        -Map~Long, PendingCheckpoint~ pendingCheckpoints
        -ScheduledFuture~?~ periodicTrigger
        -CheckpointFailureManager failureManager
        +triggerCheckpoint(triggerTimestamp, props)
        +receiveAcknowledgeMessage(AcknowledgeCheckpoint)
        +receiveDeclineMessage(DeclineCheckpoint)
        +completePendingCheckpoint(PendingCheckpoint)
    }
    
    class PendingCheckpoint {
        -long checkpointId
        -Map~OperatorID, OperatorState~ operatorStates
        -int numberOfAcknowledgedTasks
        -int numberOfTasks
        -CompletableFuture~CompletedCheckpoint~ onCompletionPromise
        +acknowledgeTask(ExecutionAttemptID, TaskStateSnapshot)
        +isFullyAcknowledged()
        +finalizeCheckpoint()
    }
    
    CheckpointCoordinator --> PendingCheckpoint : manages
    CheckpointCoordinator --> CompletedCheckpointStore : stores
    CheckpointCoordinator --> CheckpointStorage : uses

Key Methods:

triggerCheckpoint(): Initiates checkpoint with barrier injection
receiveAcknowledgeMessage(): Handles successful task acknowledgments
receiveDeclineMessage(): Handles checkpoint failures/declines
completePendingCheckpoint(): Finalizes checkpoint after all acknowledgments

2. State Backend Architecture

Flink provides multiple state backend implementations:

graph TD
    SB[AbstractStateBackend] -->|extends| HSB[HashMapStateBackend]
    SB -->|extends| ERSB[EmbeddedRocksDBStateBackend]
    SB -->|extends| CSB[ChangelogStateBackend]
    SB -->|extends| FSB[ForStStateBackend]
    
    HSB -->|uses| HeapKB[HeapKeyedStateBackend]
    HSB -->|uses| DefaultOSB[DefaultOperatorStateBackend]
    
    ERSB -->|uses| RocksKB[RocksDBKeyedStateBackend]
    ERSB -->|uses| DefaultOSB
    
    CSB -->|delegates| DelegateSB[Delegate State Backend]
    CSB -->|uses| ChangelogManager[Changelog State Manager]

State Backend Types:

HashMapStateBackend (flink-runtime/src/main/java/org/apache/flink/runtime/state/hashmap/HashMapStateBackend.java):

Storage: JVM heap memory
Snapshot: Full state serialization to external storage
Use Case: Testing, small state sizes (< 1GB)
Performance: Fastest access, memory intensive

EmbeddedRocksDBStateBackend (flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/state/rocksdb/EmbeddedRocksDBStateBackend.java):

Storage: Local RocksDB instances (disk-based)
Snapshot: Incremental checkpoints supported
Use Case: Production workloads, large state sizes (TBs)
Performance: Slower access, better scalability

3. Barrier Mechanism Implementation

Checkpoint Barriers

sequenceDiagram
    participant Source as SourceOperator
    participant StreamGraph as StreamGraph
    participant Task as StreamTask
    participant Barrier as CheckpointBarrierHandler
    participant State as StateBackend
    
    Note over Source: Checkpoint Trigger
    Source->>StreamGraph: injectCheckpointBarrier(checkpointId)
    StreamGraph->>Task: broadcastBarrier(checkpointId)
    Task->>Barrier: processBarrier(checkpointId)
    Barrier->>State: prepareSnapshot(checkpointId)
    State->>State: asyncSnapshot()
    State-->>Task: snapshotFuture
    Task-->>CheckpointCoordinator: AcknowledgeCheckpoint

Barrier Alignment

stateDiagram-v2
    [*] --> Idle
    Idle --> WaitingForFirstBarrier: barrierReceived
    WaitingForFirstBarrier --> CollectingBarriers: firstBarrierProcessed
    CollectingBarriers --> Checkpointing: allBarriersReceived
    CollectingBarriers --> WaitingForFirstBarrier: timeout
    Checkpointing --> Idle: checkpointComplete
    
    state CollectingBarriers {
        [*] --> Buffering
        Buffering --> AllBarriers: allInputsReady
    }

4. State Snapshotting Process

Synchronous vs Asynchronous Phases

graph TD
    subgraph "Synchronous Phase (Blocking)"
        S1[Stop Processing]
        S2[Create State Snapshot]
        S3[Serialize State Metadata]
        S4[Resume Processing]
    end
    
    subgraph "Asynchronous Phase (Non-Blocking)"
        A1[Write State Data]
        A2[Upload to Storage]
        A3[Cleanup Old State]
    end
    
    S1 --> S2 --> S3 --> S4
    S3 --> A1 --> A2 --> A3
    
    style S1 fill:#ffcccc
    style S4 fill:#ccffcc
    style A1 fill:#ccccff

State Serialization

classDiagram
    class StateBackend {
        <<interface>>
        +snapshot(checkpointId, timestamp)
        +restore(stateHandles)
        +dispose()
    }
    
    class KeyedStateBackend {
        +getPartitionedState()
        +snapshot(checkpointId, timestamp)
        +restore(stateHandles)
    }
    
    class OperatorStateBackend {
        +getListState()
        +getUnionState()
        +getBroadcastState()
        +snapshot(checkpointId, timestamp)
    }
    
    class StateSnapshot {
        -StateObject stateObject
        -StateHandle handle
        -long checkpointId
    }
    
    StateBackend --> KeyedStateBackend
    StateBackend --> OperatorStateBackend
    KeyedStateBackend --> StateSnapshot
    OperatorStateBackend --> StateSnapshot

Checkpoint Storage Architecture

File System State Storage

graph TD
    subgraph "Checkpoint Directory"
        CHK[checkpoint_12345/
        /flink/checkpoints/job_id/]
        
        subgraph "Metadata"
            META[_metadata]
            INFO[_metadata.info]
        end
        
        subgraph "State Data"
            OP1[operator_1/
            state_1/...]
            OP2[operator_2/
            state_2/...]
            OP3[operator_3/
            state_3/...]
        end
        
        subgraph "Shared State"
            SHARED[shared/
            incremental_state/]
        end
    end
    
    META --> META_DESC[Contains: checkpoint ID, operator states, timestamps]
    INFO --> INFO_DESC[Contains: checkpoint properties, configuration]
    OP1 --> OP1_DESC[Contains: individual operator state files]
    SHARED --> SHARED_DESC[Contains: shared state for incremental checkpoints]

Checkpoint Metadata Structure

# Example checkpoint metadata
{
  "checkpointId": 12345,
  "timestamp": 1698765432000,
  "duration": 1567,
  "stateSize": 10485760,
  "operatorStates": {
    "operator_1": {
      "parallelism": 4,
      "subtaskStates": [
        {
          "subtaskIndex": 0,
          "stateSize": 2621440,
          "stateHandle": "hdfs://.../state_0"
        },
        {
          "subtaskIndex": 1,
          "stateSize": 2621440,
          "stateHandle": "hdfs://.../state_1"
        }
      ]
    }
  }
}

Recovery Process Detailed Flow

Failure Detection and Recovery

sequenceDiagram
    participant JM as JobManager
    participant RM as ResourceManager
    participant TM as TaskManager
    participant CC as CheckpointCoordinator
    participant Storage as CheckpointStorage
    
    Note over JM: Failure Detection
    JM->>JM: Detect task failure
    JM->>RM: Request new TaskManager
    RM->>TM: Start new TaskManager
    
    Note over CC: Recovery Planning
    CC->>CC: Select latest checkpoint
    CC->>Storage: Load checkpoint metadata
    CC->>CC: Create recovery plan
    
    Note over TM: State Restoration
    TM->>Storage: Download state files
    TM->>TM: Restore operator state
    TM->>TM: Reset source positions
    TM->>CC: Notify recovery complete

State Restoration Process

graph TD
    subgraph "State Restoration"
        A[Load Checkpoint Metadata]
        B[Resolve State Handles]
        C[Download State Files]
        D[Deserialize State Data]
        E[Restore Operator State]
        F[Update Source Positions]
    end
    
    subgraph "Validation"
        V1[Verify State Integrity]
        V2[Check Compatibility]
        V3[Validate Source Positions]
    end
    
    A --> B --> C --> D --> E --> F
    D --> V1 --> V2 --> V3
    
    style A fill:#e1f5fe
    style C fill:#fff3e0
    style E fill:#e8f5e8

Advanced Configuration Patterns

Performance Tuning Matrix

graph LR
    subgraph "State Size Considerations"
        SS1[< 1GB: HashMap
        1-10GB: RocksDB
        > 10GB: RocksDB + Incremental]
    end
    
    subgraph "Latency Requirements"
        LR1[< 100ms: HashMap
        100ms-1s: RocksDB
        > 1s: RocksDB + Tuning]
    end
    
    subgraph "Checkpoint Frequency"
        CF1[< 30s: Unaligned
        30s-5min: Aligned
        > 5min: Aligned + Compression]
    end
    
    SS1 --> LR1
    LR1 --> CF1

Configuration Examples

High-Performance Configuration

# Production-grade configuration
execution.checkpointing.interval: 60s
execution.checkpointing.timeout: 10min
execution.checkpointing.min-pause: 30s
execution.checkpointing.max-concurrent-checkpoints: 1
execution.checkpointing.unaligned: true

# State backend
state.backend.type: rocksdb
state.backend.rocksdb.incremental: true
state.backend.rocksdb.memory.managed: true
state.backend.rocksdb.compaction.style: LEVEL

# Storage
state.checkpoints.dir: s3://my-bucket/flink-checkpoints
state.savepoints.dir: s3://my-bucket/flink-savepoints

Development Configuration

# Lightweight configuration for development
execution.checkpointing.interval: 30s
execution.checkpointing.timeout: 5min
state.backend.type: hashmap
state.checkpoints.dir: file:///tmp/flink-checkpoints

Monitoring and Observability

Key Metrics Dashboard

graph TD
    subgraph "Checkpoint Metrics"
        CM1[Checkpoint Duration]
        CM2[Checkpoint Size]
        CM3[Number of Failed Checkpoints]
        CM4[Alignment Time]
    end
    
    subgraph "State Metrics"
        SM1[State Size by Operator]
        SM2[State Growth Rate]
        SM3[State Access Latency]
        SM4[State Backend Performance]
    end
    
    subgraph "System Metrics"
        SysM1[JVM Memory Usage]
        SysM2[GC Pressure]
        SysM3[Network I/O]
        SysM4[Disk I/O]
    end
    
    CM1 --> ALERT1[Alert if > 5min]
    CM2 --> ALERT2[Alert if > 1GB]
    SM1 --> ALERT3[Alert if > 100GB]
    SysM1 --> ALERT4[Alert if > 80%]

Troubleshooting Flowchart

flowchart TD
    A[Checkpoint Issue] --> B{Timeout?}
    B -->|Yes| C[Check State Size]
    B -->|No| D{Alignment Timeout?}
    C --> E{State > 1GB?}
    E -->|Yes| F[Use RocksDB + Incremental]
    E -->|No| G[Check Storage Performance]
    D --> H{Backpressure?}
    H -->|Yes| I[Enable Unaligned Checkpoints]
    H -->|No| J[Check Network Latency]
    
    style A fill:#ffcccc
    style F fill:#ccffcc
    style I fill:#ccffcc

Advanced Features

Incremental Checkpointing

graph TD
    subgraph "Incremental Checkpoint Flow"
        C1[Checkpoint 1<br/>Full State]
        C2[Checkpoint 2<br/>Delta Only]
        C3[Checkpoint 3<br/>Delta Only]
        C4[Checkpoint 4<br/>Delta + Cleanup]
    end
    
    subgraph "State Management"
        SM1[Track State Changes]
        SM2[Compute Deltas]
        SM3[Upload Incremental Parts]
        SM4[Garbage Collect Old Data]
    end
    
    C1 --> SM1 --> C2 --> SM2 --> C3 --> SM3 --> C4 --> SM4
    
    style C1 fill:#ffcdd2
    style C2 fill:#c8e6c9
    style C3 fill:#c8e6c9
    style C4 fill:#fff3e0

Local Recovery

graph TD
    subgraph "Local Recovery Process"
        LR1[Task Failure]
        LR2[Check Local State]
        LR3[Local State Available?]
        LR4[Restore from Local]
        LR5[Fallback to Remote]
        LR6[Resume Processing]
    end
    
    LR1 --> LR2 --> LR3
    LR3 -->|Yes| LR4 --> LR6
    LR3 -->|No| LR5 --> LR6
    
    style LR4 fill:#ccffcc
    style LR5 fill:#ffcccc

Best Practices Summary

Design Patterns

State Partitioning: Use keyed state for scalability
State TTL: Implement automatic cleanup
Checkpoint Interval: Balance recovery vs. overhead
Storage Selection: Choose appropriate storage backend
Monitoring: Implement comprehensive observability

Code-Level Recommendations

// Example: Optimized state configuration
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Configure checkpointing
CheckpointConfig config = env.getCheckpointConfig();
config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
config.setCheckpointInterval(60000); // 1 minute
config.setCheckpointTimeout(600000); // 10 minutes
config.setMinPauseBetweenCheckpoints(30000); // 30 seconds
config.setMaxConcurrentCheckpoints(1);

// Configure state backend
EmbeddedRocksDBStateBackend backend = new EmbeddedRocksDBStateBackend(true);
backend.setDbStoragePath("/tmp/rocksdb");
env.setStateBackend(backend);

// Configure checkpoint storage
env.getCheckpointConfig().setCheckpointStorage("s3://my-bucket/checkpoints");

Conclusion

Flink's checkpoint architecture provides a robust foundation for fault-tolerant stream processing. Understanding the technical details of barrier coordination, state snapshotting, and recovery mechanisms is crucial for building production-grade streaming applications.

Key technical insights:

Distributed snapshot algorithm with barrier coordination
Pluggable state backends for different use cases
Incremental checkpointing for large state scenarios
Automatic recovery with configurable strategies
Comprehensive monitoring and troubleshooting capabilities

This architecture enables exactly-once processing guarantees while maintaining high throughput and low latency in distributed streaming environments.

Raw

flink-state-forst.md

MapState Persistence in ForSt State Backend

Overview

The ForSt (Flink on RocksDB Storage) state backend is a disaggregated state management solution that extends Flink's state storage capabilities beyond local disk limitations. Unlike traditional state backends that store state locally, ForSt can store state data on remote file systems such as HDFS, S3, or GCS, enabling unlimited state size scaling.

Key Characteristics

Disaggregated Storage: State data is stored on remote file systems, not limited by local disk capacity
LSM-Tree Structure: Built on top of RocksDB, providing efficient key-value storage
Incremental Checkpoints: Always performs asynchronous incremental snapshots
Asynchronous State Access: Supports asynchronous state operations for better performance
Remote SST Files: Stores SST (Sorted String Table) files on remote file systems

MapState Storage Architecture

Key Structure

MapState in ForSt follows the same composite key structure as RocksDB:

[KeyGroup][FlinkKey][Namespace][UserKey]

Where:

KeyGroup: Fixed-size prefix (1-2 bytes) for parallel processing
FlinkKey: The serialized key from keyBy() operations
Namespace: Serialized namespace (typically for windowing)
UserKey: The actual key within the MapState

Value Structure

Values are stored with null-sensitivity prefix:

[NullFlag][SerializedValue]

Where:

NullFlag: Boolean indicating if the value is null
SerializedValue: The actual serialized user value

Persistence Flow

The following sequence diagram illustrates how MapState operations are persisted in the ForSt state backend:

sequenceDiagram
    participant UDF as User Defined Function
    participant MapState as ForStMapState
    participant StateBackend as ForStStateBackend
    participant ForStDB as ForSt Database
    participant LocalCache as Local Cache
    participant RemoteFS as Remote File System
    participant CheckpointCoordinator as Checkpoint Coordinator

    Note over UDF, CheckpointCoordinator: MapState PUT Operation
    UDF->>MapState: put(userKey, userValue)
    MapState->>MapState: serializeKey([KeyGroup][FlinkKey][Namespace][UserKey])
    MapState->>MapState: serializeValue([NullFlag][SerializedValue])
    MapState->>ForStDB: put(serializedKey, serializedValue)
    ForStDB->>LocalCache: Store in local cache
    ForStDB->>RemoteFS: Persist SST files remotely
    ForStDB-->>MapState: Success
    MapState-->>UDF: Operation completed

    Note over UDF, CheckpointCoordinator: MapState GET Operation
    UDF->>MapState: get(userKey)
    MapState->>MapState: serializeKey([KeyGroup][FlinkKey][Namespace][UserKey])
    MapState->>ForStDB: get(serializedKey)
    ForStDB->>LocalCache: Check local cache first
    alt Cache Hit
        LocalCache-->>ForStDB: Return cached value
    else Cache Miss
        ForStDB->>RemoteFS: Fetch from remote SST files
        RemoteFS-->>ForStDB: Return serialized value
        ForStDB->>LocalCache: Update cache
    end
    ForStDB-->>MapState: Return serialized value
    MapState->>MapState: deserializeValue(value)
    MapState-->>UDF: Return deserialized value

    Note over UDF, CheckpointCoordinator: Checkpoint Process
    CheckpointCoordinator->>StateBackend: Trigger checkpoint
    StateBackend->>ForStDB: Create incremental snapshot
    ForStDB->>RemoteFS: Copy new/modified SST files
    ForStDB->>RemoteFS: Update metadata
    RemoteFS-->>ForStDB: Snapshot complete
    ForStDB-->>StateBackend: Checkpoint metadata
    StateBackend-->>CheckpointCoordinator: Checkpoint completed

    Note over UDF, CheckpointCoordinator: Recovery Process
    CheckpointCoordinator->>StateBackend: Restore from checkpoint
    StateBackend->>RemoteFS: Read checkpoint metadata
    RemoteFS-->>StateBackend: Return metadata
    StateBackend->>ForStDB: Initialize with checkpoint data
    ForStDB->>RemoteFS: Load SST files from checkpoint
    RemoteFS-->>ForStDB: SST files loaded
    ForStDB-->>StateBackend: Recovery complete
    StateBackend-->>CheckpointCoordinator: State restored

Implementation Details

ForStMapState Class

The ForStMapState class implements the MapState interface and provides:

public class ForStMapState<K, N, UK, UV> extends AbstractMapState<K, N, UK, UV>
        implements MapState<UK, UV>, ForStInnerTable<K, N, UV> {
    
    // Key serialization with composite key structure
    @Override
    public byte[] serializeKey(ContextKey<K, N> contextKey) throws IOException {
        SerializedCompositeKeyBuilder<K> builder = serializedKeyBuilder.get();
        builder.setKeyAndKeyGroup(contextKey.getRawKey(), contextKey.getKeyGroup());
        N namespace = contextKey.getNamespace();
        builder.setNamespace(
                namespace == null ? defaultNamespace : namespace, namespaceSerializer.get());
        if (contextKey.getUserKey() == null) { // value get
            return builder.build();
        }
        UK userKey = (UK) contextKey.getUserKey(); // map get
        return builder.buildCompositeKeyUserKey(userKey, userKeySerializer.get());
    }
    
    // Value serialization with null handling
    @Override
    public byte[] serializeValue(UV value) throws IOException {
        DataOutputSerializer outputView = valueSerializerView.get();
        outputView.clear();
        outputView.writeBoolean(false); // null flag
        userValueSerializer.get().serialize(value, outputView);
        return outputView.getCopyOfBuffer();
    }
}

Asynchronous Operations

ForSt supports both synchronous and asynchronous MapState operations:

Synchronous Operations

Direct RocksDB operations
Immediate serialization/deserialization
Blocking until completion

Asynchronous Operations (State API V2)

Non-blocking operations
Batch processing for better performance
Future-based completion handling

Remote File System Integration

ForSt uses a custom file system abstraction that:

Manages SST Files: Stores RocksDB SST files on remote file systems
Caching Strategy: Maintains local cache for frequently accessed data
Incremental Updates: Only transfers changed SST files during checkpoints
Fault Tolerance: Leverages remote file system durability guarantees

Performance Characteristics

Advantages

Unlimited State Size: Not constrained by local disk capacity
Incremental Checkpoints: Faster checkpoint times
Asynchronous Access: Better throughput for state operations
Cloud-Native: Optimized for cloud environments

Trade-offs

Network Latency: Remote access may introduce latency
Serialization Overhead: All data must be serialized
Cache Management: Requires sophisticated caching strategies
Experimental Status: Still in experimental phase

User Example: MapState in Flink Job Operator

Here's a practical example showing how a user puts data into a MapState from a Flink job operator:

Example: User Session Tracking

import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class UserSessionTracker extends RichMapFunction<UserEvent, UserSession> {
    
    // Declare MapState to track user sessions
    private MapState<String, UserSession> userSessions;
    
    @Override
    public void open(Configuration parameters) throws Exception {
        // Create MapState descriptor
        MapStateDescriptor<String, UserSession> sessionStateDescriptor = 
            new MapStateDescriptor<>("user-sessions", 
                                   Types.STRING, 
                                   Types.POJO(UserSession.class));
        
        // Get MapState from runtime context
        userSessions = getRuntimeContext().getMapState(sessionStateDescriptor);
    }
    
    @Override
    public UserSession map(UserEvent event) throws Exception {
        String userId = event.getUserId();
        String sessionId = event.getSessionId();
        
        // PUT operation: Store or update user session in MapState
        UserSession currentSession = userSessions.get(sessionId);
        
        if (currentSession == null) {
            // Create new session
            currentSession = new UserSession(sessionId, userId);
        }
        
        // Update session with new event
        currentSession.addEvent(event);
        currentSession.setLastActivityTime(event.getTimestamp());
        
        // PUT: This is where the MapState.put() operation happens
        // The key is sessionId, value is the updated UserSession object
        userSessions.put(sessionId, currentSession);
        
        return currentSession;
    }
}

// Supporting classes
public class UserEvent {
    private String userId;
    private String sessionId;
    private long timestamp;
    private String eventType;
    
    // Constructors, getters, setters...
}

public class UserSession {
    private String sessionId;
    private String userId;
    private List<UserEvent> events;
    private long lastActivityTime;
    
    public UserSession(String sessionId, String userId) {
        this.sessionId = sessionId;
        this.userId = userId;
        this.events = new ArrayList<>();
        this.lastActivityTime = System.currentTimeMillis();
    }
    
    public void addEvent(UserEvent event) {
        this.events.add(event);
    }
    
    // Getters and setters...
}

Job Configuration with ForSt State Backend

public class UserSessionTrackingJob {
    public static void main(String[] args) throws Exception {
        // Set up streaming execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        // Configure ForSt state backend
        env.setStateBackend(new ForStStateBackend("s3://my-bucket/flink-state"));
        
        // Enable checkpointing
        env.enableCheckpointing(60000); // Checkpoint every 60 seconds
        
        // Create data stream from Kafka
        DataStream<UserEvent> userEvents = env
            .addSource(new FlinkKafkaConsumer<>("user-events", 
                                              new UserEventDeserializer(), 
                                              kafkaProps))
            .keyBy(UserEvent::getUserId); // Key by user ID for state partitioning
        
        // Apply the session tracking operator
        DataStream<UserSession> userSessions = userEvents
            .map(new UserSessionTracker());
        
        // Output results
        userSessions.print();
        
        // Execute the job
        env.execute("User Session Tracking with ForSt State Backend");
    }
}

What Happens During MapState.put()

When userSessions.put(sessionId, currentSession) is called:

Key Serialization: The sessionId (String) is serialized into the composite key structure:
```
[KeyGroup][FlinkKey][Namespace][UserKey]
```
Where:
- KeyGroup: Determined by userId hash
- FlinkKey: Serialized userId from keyBy(UserEvent::getUserId)
- Namespace: Default namespace (null)
- UserKey: Serialized sessionId
Value Serialization: The UserSession object is serialized with null flag:
```
[NullFlag][SerializedUserSession]
```
Storage Operation: The serialized key-value pair is stored in ForSt:
- First in local cache for fast access
- Then persisted to remote file system (S3 in this example)
Checkpoint Integration: During checkpointing, the MapState data is included in the incremental snapshot and persisted to the configured remote storage.

Key Benefits in This Example

Scalability: Can handle unlimited user sessions without local disk constraints
Fault Tolerance: Session data survives job failures and restarts
Performance: Local caching provides fast access to frequently used sessions
Cloud-Native: Leverages S3 for durable, scalable storage

Configuration

Basic Configuration

# Enable ForSt state backend
state.backend: forst

# Remote storage configuration
state.forst.remote.directory: s3://my-bucket/flink-state

# Local cache configuration
state.forst.local.directory: /tmp/forst-cache

# Memory configuration
state.forst.memory.managed.fraction: 0.4
state.forst.memory.write-buffer.size: 64MB

Advanced Configuration

# ForSt-specific options
state.forst.options:
  # SST file compression
  compression.type: lz4
  
  # Write buffer management
  write_buffer_size: 67108864
  
  # Block cache size
  block_cache_size: 268435456
  
  # Bloom filter
  bloom_filter_bits_per_key: 10

Use Cases

ForSt state backend is particularly suitable for:

Large State Applications: Jobs with state sizes exceeding local disk capacity
Cloud-Native Deployments: Kubernetes-based Flink deployments
High Availability Setups: Production environments requiring fault tolerance
Asynchronous Processing: Applications benefiting from async state access
Cost Optimization: Reducing local storage requirements

Limitations

Current limitations of ForSt state backend:

Experimental Status: Not yet production-ready
Key/Value Size Limits: Maximum 2^31 bytes per key/value (same as RocksDB)
Checkpoint Types: Only supports incremental checkpoints
Savepoint Support: Limited savepoint functionality
File Merging: Does not support file-merging checkpoints

Comparison with Other State Backends

Feature	HashMapStateBackend	EmbeddedRocksDBStateBackend	ForStStateBackend
State Size	Limited by JVM heap	Limited by local disk	Unlimited (remote storage)
Performance	Fastest (in-memory)	Medium (local disk)	Variable (network dependent)
Checkpoint Type	Full	Incremental	Incremental only
Async Support	No	No	Yes
Production Ready	Yes	Yes	Experimental
Memory Usage	High	Low	Medium (cache)

Conclusion

The ForSt state backend represents a significant advancement in Flink's state management capabilities, enabling truly scalable stateful stream processing applications. By leveraging remote file systems and asynchronous operations, it addresses the limitations of traditional state backends while maintaining the reliability and consistency guarantees that Flink provides.

The disaggregated architecture makes it particularly well-suited for cloud-native deployments and applications with very large state requirements, though it's important to consider the experimental nature and current limitations when evaluating it for production use.

coderplay/0-flink-checkpointing.md

Flink Checkpointing: End-to-End Guide

Table of Contents

Overview

Key Concepts

Architecture Components

Component Responsibilities

Checkpoint Lifecycle

Detailed Steps

Barrier Mechanism

Aligned Checkpoints

Unaligned Checkpoints

State Snapshotting

State Backend Types

Snapshot Process

Recovery Process

Recovery Steps

Configuration

Basic Configuration

State Backend Configuration

Programmatic Configuration

Performance Tuning

Checkpoint Interval Tuning

Performance Best Practices

Monitoring Metrics

Troubleshooting

Common Issues

Debugging Steps

Recovery Troubleshooting

Best Practices Summary

Apache Flink Checkpoint Architecture: Technical Deep Dive

Executive Summary

System Architecture Overview

Core Components Deep Dive

1. Checkpoint Coordinator (JobManager)

Key Methods:

2. State Backend Architecture

State Backend Types:

3. Barrier Mechanism Implementation

Checkpoint Barriers

Barrier Alignment

4. State Snapshotting Process

Synchronous vs Asynchronous Phases

State Serialization

Checkpoint Storage Architecture

File System State Storage

Checkpoint Metadata Structure

Recovery Process Detailed Flow

Failure Detection and Recovery

State Restoration Process

Advanced Configuration Patterns

Performance Tuning Matrix

Configuration Examples

High-Performance Configuration

Development Configuration

Monitoring and Observability

Key Metrics Dashboard

Troubleshooting Flowchart

Advanced Features

Incremental Checkpointing

Local Recovery

Best Practices Summary

Design Patterns

Code-Level Recommendations

Conclusion

MapState Persistence in ForSt State Backend

Overview

Key Characteristics

MapState Storage Architecture

Key Structure

Value Structure

Persistence Flow

Implementation Details

ForStMapState Class

Asynchronous Operations

Synchronous Operations

Asynchronous Operations (State API V2)

Remote File System Integration

Performance Characteristics

Advantages