WebRTC Historical Video Playback — Design Document

Serve historical video from video-service to browsers via WebRTC. Pure Go implementation using pion/webrtc. No CGO, no external processes.

Date: 2026-03-09 Status: Draft — reviewed by Gemini, updated with findings

1. Problem Statement

The video-service stores historical H.264 video as SVS1 segments in S3. Today, retrieving video requires either:

gRPC GetSegmentFrames + client-side reconstruction (no browser support)
GetImage for single JPEG frames (spawns FFmpeg per request, doesn't scale)
A download CLI that pipes to FFmpeg to produce MP4 files (offline only)

Browsers cannot consume any of these. The UI expects an HTTP video endpoint that doesn't exist. There is no low-latency seeking, no live-to-historical unification path, and every frame extraction spawns an FFmpeg process.

Goal: Stream stored H.264 NALUs directly to browsers via WebRTC — zero transcoding, sub-second seek, hardware-accelerated decoding on the client.

Codec Scope

The ingestion pipeline also supports H.265 (HEVC), but most browsers do not support HEVC over WebRTC. This design targets H.264 only. On session creation, the server inspects the first segment's initialization data. If the stream contains HEVC NALUs (NAL type >= 32), the server returns 422 Unprocessable Entity with {"error": "HEVC streams not supported over WebRTC"}. HEVC transcoding is a future extension, not part of this design.

2. Design Overview

sequenceDiagram
    participant B as Browser
    participant S as video-service
    participant S3 as S3 / GCS

    B->>S: POST /webrtc/session {stream_id, start_ts, mode, sdp_offer}
    S->>S: Create PeerConnection + H.264 track + data channel
    S-->>B: 201 {session_id, sdp_answer}

    B->>S: PATCH /webrtc/session/{id}/ice {candidates}
    S-->>B: 204

    Note over B,S: ICE connectivity established

    S->>S3: Fetch segment containing start_ts
    S3-->>S: SVS1 blob
    S->>S: Parse index, extract NALUs from keyframe <= start_ts
    S->>B: RTP packets (H.264 NALUs via WebRTC track)
    S->>B: Data channel: {"evt":"playing","timestamp_ms":...}

    B->>S: Data channel: {"cmd":"seek","timestamp_ms":...}
    S->>S3: Fetch segment for new timestamp
    S->>B: RTP packets from new position
    S->>B: Data channel: {"evt":"playing","timestamp_ms":...}

    B->>S: Data channel: {"cmd":"pause"}
    S->>B: Data channel: {"evt":"paused"}

    Note over B,S: On disconnect or DELETE
    B->>S: DELETE /webrtc/session/{id}
    S->>S: Cleanup PeerConnection, goroutines

3. Playback Modes

Two modes govern how segments are played and how time gaps are handled.

3.1 Clip Mode (`mode: "clip"`)

Play a bounded time window. The server streams frames from start_ts_ms to end_ts_ms (or end of the containing segment if end_ts_ms is omitted).

Behavior:

Playback begins at the keyframe <= start_ts_ms within the first matching segment.
Frames are sent in order until end_ts_ms is reached or the segment ends.
If multiple segments fall within the time range, they are played sequentially.
Gaps between segments are skipped silently — the next segment starts immediately after the previous one ends. A gap event is emitted so the browser can show elapsed real time, but playback does not pause.
When the last frame within the range is sent, the server emits ended.

Use case: Reviewing a specific motion event or time window.

3.2 Continuous Mode (`mode: "continuous"`)

Play forward from start_ts_ms through all available segments indefinitely (or until end_ts_ms if provided).

Behavior:

Playback begins at the keyframe <= start_ts_ms.
When a segment ends, the server queries the DB for the next segment (ts_start_ms > current_segment.ts_end_ms, ordered ascending, limit 1).
If there is a gap: The server emits a gap event with the gap duration and the next segment's start time, then pauses the RTP clock for a configurable hold period (default: 0 — no hold, jump immediately). After the hold, playback resumes from the next segment.
If there is no next segment: The server emits waiting and polls for new segments every 2 seconds. When one appears, playback resumes.
Near-live latency caveat: The ingestion pipeline flushes chunks to S3 every ~60 seconds (on keyframe boundaries). This means continuous-mode tailing has an inherent ~60-second delay behind the live camera feed. True low-latency live streaming requires a separate mode: "live" path that subscribes to NATS directly (see Future Extensions).
When end_ts_ms is reached, the server emits ended.
If no end_ts_ms is set, playback continues until the browser disconnects or sends stop.

Use case: Watching a full recording session, or tailing a live recording.

3.3 Time Range Parameters

Parameter	Required	Default	Description
`stream_id`	Yes	—	UUID of the video stream
`start_ts_ms`	Yes	—	Playback start (epoch ms)
`end_ts_ms`	No	None	Playback end (epoch ms). Omit for open-ended.
`mode`	No	`continuous`	`"clip"` or `"continuous"`

4. Signaling API (HTTP)

Added to the video-service's HTTP listener alongside any future REST endpoints.

4.1 Create Session

POST /webrtc/session
Content-Type: application/json

{
  "stream_id":    "uuid-string",
  "start_ts_ms":  1709942400000,
  "end_ts_ms":    1709942460000,   // optional
  "mode":         "clip",           // "clip" | "continuous", default "continuous"
  "sdp_offer":    "v=0\r\n..."
}

Response (201 Created):

{
  "session_id":  "uuid-string",
  "sdp_answer":  "v=0\r\n..."
}

Errors:

400 — Invalid SDP, missing stream_id, start_ts after end_ts
404 — stream_id not found or no segments in range
422 — Stream contains HEVC (unsupported codec for WebRTC)
429 — Max concurrent sessions reached
403 — RLS: principal lacks access to this stream

Authentication: Same x-sky-principal header used by existing gRPC endpoints. The signaling handler extracts the principal and enforces RLS on segment queries.

4.2 Trickle ICE

PATCH /webrtc/session/{session_id}/ice
Content-Type: application/json

{
  "candidate": "candidate:842163049 1 udp ...",
  "sdp_mid":   "0",
  "sdp_mline_index": 0
}

Response: 204 No Content

4.3 Teardown

DELETE /webrtc/session/{session_id}

Response: 204 No Content

Idempotent. Also triggered automatically by ICE disconnection or idle timeout.

5. Media Pipeline

5.1 Zero-Transcode Path

The stored H.264 NALUs are sent directly to the browser's hardware decoder. No FFmpeg, no re-encoding, no CGO.

S3 blob (SVS1)
  → SegmentReader: parse index, read [uint32 len][VideoFrameSpec protobuf] entries
  → FrameExtractor: unmarshal pb.VideoFrameSpec → raw H.264 frame bytes
  → NALUExtractor: strip Annex-B start codes → bare NALUs
  → pion H264Packetizer: NALUs → RTP packets
  → pion TrackLocalStaticRTP → SRTP → DTLS → ICE → Browser
  → Browser RTCPeerConnection → MediaStream → <video> element
  → Browser hardware H.264 decoder

SVS1 frame format note: SVS1 frames are NOT raw H.264 bytes. Each frame is a length-prefixed marshaled pb.VideoFrameSpec protobuf message. The SegmentReader must read the 4-byte uint32 length prefix, read that many bytes, unmarshal the VideoFrameSpec, then extract the Data field which contains the actual H.264 Annex-B frame bytes.

5.2 SegmentReader

Wraps existing S3 fetch + SVS1 parsing logic. Yields frames as an iterator:

type FrameData struct {
    TimestampMs uint64
    NALUs       [][]byte  // Bare NALUs (no start codes)
    IsKeyFrame  bool
}

type SegmentReader interface {
    // ReadFrom returns frames from the keyframe <= startTs to endTs.
    // If endTs is 0, reads to end of segment.
    ReadFrom(ctx context.Context, segmentPath string, startTs, endTs uint64) iter.Seq2[FrameData, error]
}

Annex-B stripping: Use h264.AnnexB.Unmarshal() from github.com/bluenviron/mediacommon/v2/pkg/codecs/h264 (already a transitive dependency via gortsplib). No hand-rolled byte parsing. The same library provides IsRandomAccess() for keyframe detection, SPS.Unmarshal() for profile extraction, and NAL type constants. HEVC detection uses the companion h265 package.

5.3 Pacer

Controls the rate at which frames are written to the WebRTC track.

type Pacer struct {
    track     *webrtc.TrackLocalStaticRTP
    speed     float64        // 1.0 = realtime, 2.0 = 2x, etc.
    rtpTS     uint32         // Current RTP timestamp (90kHz clock)
    ticker    *time.Timer
    pauseCh   chan struct{}
    resumeCh  chan struct{}
}

Timing logic:

Compute inter-frame delay: (frame[n+1].TimestampMs - frame[n].TimestampMs)
Adjust for speed: delay = delay / speed
Sleep for adjusted delay between frames
RTP timestamp increments: delta_ms * 90 (90kHz clock)

SPS/PPS injection:

Sent before the first frame of every segment.
Re-sent on every seek (decoder needs re-initialization).
Read from the initialization_segment field of the pb.SegmentIndex stored inside the S3 blob header (NOT from an Ent schema field — VideoSegment does not have this field). The server must fetch the blob header (first ~8KB) from S3 to parse the SegmentIndex protobuf and extract initialization_segment.
Missing SPS/PPS fallback: If the segment's SegmentIndex has no initialization_segment, the server scans the segment's first keyframe for NAL types 7 (SPS) and 8 (PPS). If still not found, it walks backwards through previous segments in the same stream (max 10 segments to cap S3 calls) until SPS/PPS are located. If none exist anywhere, the session emits error and tears down.

SDP codec negotiation:

The server sets profile-level-id in the H.264 RTP codec parameters to match the stored stream's actual profile. This is extracted from byte 1-3 of the SPS NAL. If unknown, defaults to Constrained Baseline (42e01f) which all browsers support.

RTP timestamp overflow:

The 90kHz RTP clock wraps uint32 every ~13.25 hours. The pacer tracks cumulative playback time independently and lets the uint32 wrap naturally. pion/webrtc and browser jitter buffers handle uint32 wrapping correctly as long as consecutive timestamps are monotonically increasing modulo 2^32.

5.4 Segment Prefetching

[Playing segment N] ──── 80% through ────→ [Prefetch segment N+1 from S3]
                                                      │
[Segment N ends] ─────────────────────────→ [Start from prefetched buffer]

A background goroutine fetches the next segment when playback reaches 80% of the current segment's duration. The fetched blob is held in memory (segments are ~60s of H.264, typically 2-10 MB). If the prefetch fails, it retries. If the next segment doesn't exist yet (continuous mode), it polls.

6. Data Channel Protocol

A data channel named "control" carries JSON messages for playback control.

6.1 Browser → Server Commands

{"cmd": "seek",    "timestamp_ms": 1709942430000}
{"cmd": "pause"}
{"cmd": "resume"}
{"cmd": "speed",   "rate": 2.0}
{"cmd": "stop"}

Command	Behavior
`seek`	Pacer stops. Any in-flight S3 fetch is cancelled via context. Server finds segment + keyframe for timestamp. SPS/PPS re-sent. Pacer resumes from new position. Rapid seeks are coalesced — only the last seek within a 100ms window is executed.
`pause`	Pacer suspends. Track stays open. No RTP sent.
`resume`	Pacer resumes from where it paused.
`speed`	Pacer adjusts inter-frame delay. Accepted range: 0.25 to 16.0.
`stop`	Server emits `ended`, tears down session.

6.2 Server → Browser Events

{"evt": "playing",   "timestamp_ms": 1709942400000}
{"evt": "paused",    "timestamp_ms": 1709942415000}
{"evt": "buffering"}
{"evt": "seeking",   "timestamp_ms": 1709942430000}
{"evt": "seeked",    "timestamp_ms": 1709942428000}
{"evt": "gap",       "from_ms": 1709942460000, "to_ms": 1709942520000, "duration_ms": 60000}
{"evt": "waiting"}
{"evt": "ended"}
{"evt": "error",     "message": "segment not found"}
{"evt": "position",  "timestamp_ms": 1709942430000}

Event	When
`playing`	Playback started or resumed
`paused`	Pacer suspended
`buffering`	Fetching segment from S3, track temporarily idle
`seeking`	Seek command received, looking up segment
`seeked`	Seek complete, includes actual position (keyframe-aligned)
`gap`	Time gap detected between segments. In clip mode: skipped. In continuous mode: skipped after hold period.
`waiting`	Continuous mode: no next segment yet, polling.
`ended`	Reached end_ts, end of last segment (clip mode), or stop command.
`error`	Unrecoverable error, session will be torn down.
`position`	Periodic timestamp update. Sent every 1 second of playback.

7. Session Management

7.1 Session Struct

type Session struct {
    ID            string
    StreamID      uuid.UUID
    StartTsMs     uint64
    EndTsMs       uint64          // 0 = open-ended
    Mode          PlaybackMode    // Clip | Continuous

    PC            *webrtc.PeerConnection
    VideoTrack    *webrtc.TrackLocalStaticRTP
    DataChannel   *webrtc.DataChannel
    Pacer         *Pacer
    SegmentReader SegmentReader

    // Lifecycle
    ctx           context.Context
    cancel        context.CancelFunc
    createdAt     time.Time
    lastActivity  time.Time
}

7.2 SessionManager

type SessionManager struct {
    sessions     sync.Map              // session_id → *Session
    maxSessions  int                   // default: 50
    idleTimeout  time.Duration         // default: 5 minutes
    iceConfig    ICEConfig
}

func (sm *SessionManager) Create(req CreateSessionRequest) (*Session, error)
func (sm *SessionManager) Get(id string) (*Session, error)
func (sm *SessionManager) AddICECandidate(id string, candidate webrtc.ICECandidateInit) error
func (sm *SessionManager) Delete(id string) error

7.3 Cleanup

Sessions are cleaned up when:

Browser calls DELETE /webrtc/session/{id}
Browser sends stop on data channel
ICE connection state goes to disconnected or failed — detected via PeerConnection.OnConnectionStateChange callback for immediate teardown (don't rely solely on the reaper)
No data channel activity for idleTimeout (default 5 min)
Server shutdown (graceful drain: close all PeerConnections)

A reaper goroutine runs every 30 seconds, scanning for idle sessions.

Memory budget: With maxSessions=50 and 2 blobs per session (current + prefetched, ~10MB each), worst case is ~1GB. The SessionManager should track total buffered bytes and reject new sessions if a global memory limit (env: WEBRTC_MAX_BUFFER_MB, default 512) would be exceeded.

8. ICE Configuration

type ICEConfig struct {
    STUNServers []string      // env: WEBRTC_STUN_SERVERS (comma-separated)
    TURNServers []TURNServer  // env: WEBRTC_TURN_SERVERS (JSON array)
}

type TURNServer struct {
    URL        string `json:"url"`        // "turn:turn.example.com:3478"
    Username   string `json:"username"`
    Credential string `json:"credential"`
}

Defaults:

STUN: stun:stun.l.google.com:19302
TURN: None (optional, configure for restrictive NATs)

Future: Embed pion/turn (pure Go) for self-hosted TURN.

9. Playback State Machine

stateDiagram-v2
    [*] --> Created: POST /webrtc/session
    Created --> Connecting: ICE negotiation
    Connecting --> Buffering: ICE connected, fetching first segment
    Buffering --> Playing: Frames ready, pacer started
    Playing --> Paused: pause command
    Paused --> Playing: resume command
    Playing --> Seeking: seek command
    Seeking --> Buffering: Loading segment for new position
    Playing --> Gap: Segment ended, gap detected
    Gap --> Buffering: Next segment available
    Gap --> Waiting: No next segment (continuous mode)
    Waiting --> Buffering: New segment appeared
    Playing --> Ended: end_ts reached or last segment done
    Ended --> Seeking: seek command (restart)
    Ended --> [*]: DELETE or timeout

    Playing --> [*]: disconnect / error / stop
    Paused --> [*]: disconnect / error / stop / timeout
    Waiting --> [*]: disconnect / error / stop / timeout

10. Error Handling

Scenario	Behavior
S3 fetch fails	Retry 3x with 1s backoff. If still failing, emit `error` event + teardown.
Segment has no keyframe before start_ts	Fall back to segment start. Emit `seeked` with actual position.
No segments found for stream_id + time range	Return 404 on session creation.
ICE fails to connect within 30s	Server tears down session.
Data channel message parse error	Log + ignore. Max message size: 4KB. Messages exceeding this are dropped.
Segment has no SPS/PPS	Scan first keyframe NALUs, then walk backwards through prior segments. If none found, emit `error`.
HEVC stream requested	Return `422` on session creation.
start_ts before first segment	Snap to first segment start. Emit `seeked` with actual position.
start_ts after last segment	Return `404` (clip mode) or snap to last segment and emit `waiting` (continuous mode).
Segment written during read	Safe — S3 objects are immutable once written. In-progress chunks are not yet in S3.
Rapid seek commands	Debounced: 100ms window, only last seek executes. In-flight S3 fetches cancelled via context.
Browser disappears (no ICE keepalive)	ICE state → `disconnected`. Cleanup after 10s grace period.
Max sessions reached	Return 429 on session creation.
Playback speed out of range	Clamp to [0.25, 16.0], emit `speed` event with actual rate.

11. Configuration

Environment Variable	Default	Description
`WEBRTC_ENABLED`	`false`	Enable WebRTC endpoints
`WEBRTC_HTTP_PORT`	`8083`	HTTP port for signaling
`WEBRTC_MAX_SESSIONS`	`50`	Max concurrent sessions
`WEBRTC_IDLE_TIMEOUT`	`5m`	Session idle timeout
`WEBRTC_STUN_SERVERS`	`stun:stun.l.google.com:19302`	Comma-separated STUN URIs
`WEBRTC_TURN_SERVERS`	`""`	JSON array of TURN configs
`WEBRTC_PREFETCH_THRESHOLD`	`0.8`	Prefetch next segment at this %
`WEBRTC_GAP_HOLD_MS`	`0`	Pause duration at gaps (continuous mode)
`WEBRTC_POLL_INTERVAL`	`2s`	Poll interval for new segments (continuous mode)

12. Dependencies

Package	Version	Purpose	CGO
`github.com/pion/webrtc/v4`	latest	WebRTC peer connections, tracks, data channels	No
`github.com/pion/rtp`	v1.8.x	Already in go.mod	No

pion/webrtc transitively pulls in pion/ice, pion/dtls, pion/srtp, pion/interceptor, pion/sdp — all pure Go.

No new CGO dependencies. No external processes.

13. File Layout

cloud/video-service/
├── internal/
│   ├── webrtc/
│   │   ├── session.go            # Session struct, PeerConnection lifecycle
│   │   ├── session_manager.go    # Create/get/delete, enforce limits, reaper
│   │   ├── signaling.go          # HTTP handlers (POST/PATCH/DELETE)
│   │   ├── pacer.go              # Frame timing, speed control, RTP writes
│   │   ├── segment_reader.go     # SVS1 → NALU iterator (wraps existing S3/parse)
│   │   ├── nalu.go               # Annex-B start code stripping
│   │   ├── data_channel.go       # Control protocol encode/decode/dispatch
│   │   ├── playback.go           # Playback loop: segment iteration, gap handling
│   │   ├── config.go             # ICE + session config from env vars
│   │   └── metrics.go            # Prometheus: active sessions, bytes sent, etc.
│   ├── handler/
│   │   └── query_handler.go      # Existing (unchanged)
│   ├── server/
│   │   └── ingestion.go          # Existing (unchanged)
│   └── ...
├── cmd/
│   └── server/
│       └── main.go               # Add HTTP listener + webrtc.RegisterHandlers()
└── ...

14. RLS Integration

The signaling handler extracts x-sky-principal from HTTP headers (same as gRPC metadata). On session creation:

Parse principal from x-sky-principal header using the existing security package. The header may be JSON or Base64-encoded JSON — reuse the auto-detection logic from security.UnaryServerInterceptor.
Attach to context via security.WithPrincipal(ctx, &principal) so that all Ent queries pass through the existing RLSInterceptor.
Query VideoSegment with RLS interceptor to verify the principal can access the requested stream_id + time range.
If no accessible segments, return 403.
Store principal in session for subsequent segment fetches (seek).

All segment queries during playback (seek, prefetch, continuous-mode polling) go through the same RLS-filtered Ent queries. The session holds a reference to the *ent.Client from cmd/server/main.go and the existing *minio.Client and s3Bucket for S3 access — no new connections are created.

15. Future Extensions

Live streaming: Same WebRTC track, fed by NATS subscriber instead of S3. Session creation with mode: "live" subscribes to the camera's NATS topic.
Embedded TURN: pion/turn server in the same binary.
Adaptive bitrate: If multiple quality levels are stored, switch tracks based on RTCP receiver reports.
Audio: Add a second track for audio when audio streams are stored.
Recording bookmarks: Browser sends timestamp markers via data channel for annotation.

16. Resolved Design Decisions

Question	Decision	Rationale
HTTP port	Separate port (8083)	Avoids `cmux` complexity. gRPC stays on 8082, HTTP signaling on 8083. Both share the same Ent/S3 clients.
Segment cache	Yes, per-session LRU	Current segment blob is held in memory for seeks within the same segment. Evicted on segment transition. Max 2 blobs per session (current + prefetched).
CORS	Configurable via `WEBRTC_CORS_ORIGINS` env var	Defaults to `*` in dev, must be set explicitly in production. Applied via standard Go CORS middleware on the HTTP mux.
Goroutine lifecycle	Session-scoped context	All goroutines (pacer, prefetcher, data channel reader, reaper) derive from `session.ctx`. Calling `session.cancel()` tears down everything.

17. Gemini Review Summary

Round 1 (2026-03-09, design review)

Finding	Severity	Resolution
HEVC streams unsupported by browsers	CRITICAL	Added codec detection + 422 rejection (Section 1)
Missing SPS/PPS in segments	CRITICAL	Added fallback: scan keyframes, then walk prior segments (Section 5.3)
~60s latency for near-live tailing	IMPORTANT	Documented as inherent limitation, deferred to `mode: "live"` (Section 3.2)
RTP timestamp overflow at 13h	IMPORTANT	Addressed: uint32 wraps naturally, pion handles it (Section 5.3)
SDP profile-level-id mismatch	IMPORTANT	Added: extract from SPS, default to Constrained Baseline (Section 5.3)
Rapid seek interleaving	IMPORTANT	Added: 100ms debounce + context cancellation (Section 6.1, 10)
RLS context propagation	IMPORTANT	Explicit `security.WithContext` call documented (Section 14)
S3 client reuse	MINOR	Explicit: reuse existing minio.Client + s3Bucket (Section 14)
Data channel max message size	MINOR	Added: 4KB limit (Section 10)

Round 2 (2026-03-09, TDD plan + codebase cross-check)

Finding	Severity	Resolution
SVS1 frames are Protobuf-wrapped, not raw H.264	CRITICAL	Updated Section 5.1 + 5.2: SegmentReader must unmarshal `pb.VideoFrameSpec`
`initialization_segment` not an Ent field	CRITICAL	Updated Section 5.3: read from `pb.SegmentIndex` in S3 blob header
SPS/PPS fallback can trigger many S3 calls	IMPORTANT	Added max 10 segment walk-back cap (Section 5.3)
Memory pressure at 50 concurrent sessions	IMPORTANT	Added `WEBRTC_MAX_BUFFER_MB` global memory limit (Section 7.3)
ICE disconnect needs immediate cleanup	IMPORTANT	Added `OnConnectionStateChange` callback (Section 7.3)
`security.WithContext` should be `WithPrincipal`	MINOR	Fixed in Section 14
x-sky-principal may be Base64-encoded	MINOR	Noted in Section 14

enachb/2026-03-09-webrtc-playback-design.md