Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created February 28, 2026 23:49
Show Gist options
  • Select an option

  • Save denniswon/c303e43a9629142dcac51d7a493554ff to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/c303e43a9629142dcac51d7a493554ff to your computer and use it in GitHub Desktop.
Cross-Chain Gateway — Phase 1 Implementation Plan
★ Insight ─────────────────────────────────────
Two operator-gateway communication patterns exist:
1. Sync path (newt_createTask): Gateway calls operator RPC (newt_evaluateAndSign), collects SignedTaskResponse as inline response.
Gateway drives the flow.
2. Async path (newt_sendTask): Gateway fires process_task to operator (returns immediately), operator processes in background, then
pushes SignedTaskResponse back to gateway's /aggregate Axum endpoint via ClientAggregator.
The /aggregate endpoint is a separate Axum route (not on /rpc jsonrpsee server), with no API key auth — BLS signature verification
serves as authentication. The watcher's /watcher endpoint follows this exact same pattern.
─────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ready to code?
Here is Claude's plan:
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
Cross-Chain Gateway — Phase 1 Implementation Plan
Context
Newton Protocol currently requires deploying a full protocol stack (gateway, operators, challenger) per chain. The
docs/CROSS_CHAIN_GATEWAY.md architecture doc defines a multi-chain design where one gateway and one operator instance can serve N
chains. This plan breaks Phase 1 into incremental sub-phases, each independently shippable, testable, and rollback-safe.
Branch strategy: Each sub-phase is a separate PR to main. No long-lived feature branch.
Key constraint: Backward compatibility — existing single-chain deployments must work unchanged at every step.
---
Chain Watcher: Smart Filter Design
The Chain Watcher is NOT a dumb event relay. It is a smart filter that distinguishes direct on-chain tasks from gateway-originated
tasks.
Communication: Hybrid Approach (Redis SET + RPC)
Optimized for minimum gateway hot-path load. The gateway's only overhead is one async Redis SADD per offchain task (~0.1ms,
non-blocking, fire-and-forget).
Offchain path (99.9%):
Client → Gateway → broadcasts task to operators → BLS aggregation → on-chain
→ spawns background SADD to Redis SET (fire-and-forget)
key: newton:seen-tasks:{chain_id}, value: task_id, TTL: 1 hour
Direct on-chain path (0.1% — censorship resistance):
Someone → creates task directly on-chain via TaskManager.createTask()
Chain Watcher → detects NewTaskCreated event via WebSocket
Chain Watcher → checks Redis SET: SISMEMBER newton:seen-tasks:{chain_id} task_id
EXISTS → gateway already handled it → skip
NOT EXISTS → direct on-chain task → calls gateway RPC: newt_submitDirectOnchainTask
Gateway → processes task (initialize aggregation, wait for quorum, submit response)
Why this design over alternatives:
┌──────────────────────┬───────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ Direction │ Approach │ Why │
├──────────────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Gateway → Watcher │ Redis SET │ Non-blocking, off hot path. If Redis down, watcher relays duplicates — gateway │
│ (task ID tracking) │ (SADD + TTL) │ deduplicates via gateway_tasks DashMap. Zero impact on offchain path. │
├──────────────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Watcher → Gateway │ Axum POST to │ Reliable delivery with retry. Direct tasks are censorship-resistance critical (0.1% of │
│ (direct task relay) │ /watcher │ traffic). Consistent with operator → gateway /aggregate pattern. No API key auth │
│ │ │ (internal service). │
├──────────────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Watcher → Gateway │ Axum POST to │ Same endpoint, routed by event_type. Same rationale — reliable, low-frequency, │
│ (operator events) │ /watcher │ consistent pattern. │
└──────────────────────┴───────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
Why NOT Redis Pub/Sub for watcher → gateway:
- Pub/Sub is fire-and-forget — if gateway restarts, messages are lost
- Direct on-chain tasks are censorship-resistance critical — reliability > simplicity
- HTTP POST gives watcher confirmation that gateway received the event (retry on failure)
- Consistent with existing operator → gateway /aggregate pattern (Axum route, no API key, reqwest POST)
Why NOT migrate operator /aggregate to Redis:
- Operators need request/response semantics (signature acceptance confirmation, quorum progress)
- /aggregate already works well — no reason to change a working pattern
- Future migration to NATS JetStream (docs/STREAMING_CONSENSUS.md) is the right time to revisit
Why Chain Watcher and Challenger Are Separate Binaries
┌──────────────┬────────────────────────────────────────────┬─────────────────────────────────────────────────────────┐
│ Concern │ Chain Watcher │ Challenger │
├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤
│ CPU profile │ Lightweight — event filtering, LRU lookups │ Heavy — ZK proof generation (SP1), policy re-evaluation │
├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤
│ Availability │ Always-on, real-time (~ms latency) │ Tolerates delays (100-block challenge window) │
├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤
│ Failure mode │ Must never miss events │ Can retry proofs within window │
├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤
│ Scaling │ 1 per chain (HA pair later) │ 1 per chain, CPU-bound │
└──────────────┴────────────────────────────────────────────┴─────────────────────────────────────────────────────────┘
Risk of combining: ZK proof generation is CPU-intensive. A catch_unwind-escaped panic or CPU starvation during proof gen could
degrade or kill the watcher — gateway loses its on-chain event feed. That's a censorship-resistance failure.
Combine later if: running 2N containers creates material ops overhead AND profiling confirms challenger CPU load doesn't compete with
watcher event loop.
---
Sub-Phase 0: Chain Watcher Crate (Smart Event Filter)
Goal: Create crates/chain-watcher/ crate with a smart event filter that tracks gateway-broadcasted tasks and only relays true direct
on-chain tasks.
0a: Create crates/chain-watcher/ crate — listener + event types
Files to create:
- crates/chain-watcher/Cargo.toml
- crates/chain-watcher/src/lib.rs
- crates/chain-watcher/src/listener.rs — ChainWatcherListener (WebSocket → mpsc)
- crates/chain-watcher/src/event.rs — ChainEvent, ChainEventType enums
Pattern to reuse: crates/challenger/src/dest/listener.rs:57-87
- Spawned background task with tokio::spawn()
- Bounded mpsc::channel(100) for backpressure
- WebSocket reconnection with exponential backoff + max retries
- CancellationToken for graceful shutdown
- Returns mpsc::Receiver<ChainEvent>
Event types:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChainEvent {
pub chain_id: u64,
pub event_type: ChainEventType,
pub block_number: u64,
pub tx_hash: B256,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum ChainEventType {
/// Direct on-chain task (not seen in gateway broadcasts)
DirectOnchainTask { task_index: u32, task: Task },
/// Operator added to an operator set
OperatorAdded { operator_id: FixedBytes<32>, operator_set_id: u32 },
/// Operator removed from an operator set
OperatorRemoved { operator_id: FixedBytes<32>, operator_set_id: u32 },
}
Key files to reference:
- crates/challenger/src/dest/listener.rs — WebSocket reconnection pattern
- crates/transporter/src/listener.rs — SourceChainListener (multi-event subscription)
- crates/gateway/src/rpc/api/events.rs — current gateway event listener logic
Tests: Unit tests for event serialization. Integration test with anvil — emit NewTaskCreated, verify listener captures it.
Verification: cargo test -p newton-chain-watcher
---
0b: Add smart filter logic + gateway RPC client
Files to create/modify:
- crates/chain-watcher/src/filter.rs — TaskFilter (checks Redis SET for seen task IDs)
- crates/chain-watcher/src/gateway_client.rs — HTTP client to push events to gateway RPC
- crates/chain-watcher/Cargo.toml — add redis, reqwest dependencies
TaskFilter design (Redis SET-backed):
pub struct TaskFilter {
redis: deadpool_redis::Pool,
chain_id: u64,
}
impl TaskFilter {
/// Check if a task was already handled by gateway (exists in Redis SET)
pub async fn is_seen_by_gateway(&self, task_id: &B256) -> Result<bool> {
let key = format!("newton:seen-tasks:{}", self.chain_id);
let mut conn = self.redis.get().await?;
let exists: bool = redis::cmd("SISMEMBER")
.arg(&key)
.arg(task_id.as_slice())
.query_async(&mut conn)
.await?;
Ok(exists)
}
/// Check if task is a direct on-chain task (NOT seen by gateway)
pub async fn is_direct_onchain(&self, task_id: &B256) -> Result<bool> {
Ok(!self.is_seen_by_gateway(task_id).await?)
}
}
GatewayClient (HTTP POST to /watcher with retry):
/// Client for pushing events to gateway's /watcher endpoint.
/// Pattern matches operator's ClientAggregator (crates/operator/src/client.rs)
/// which POSTs to gateway's /aggregate endpoint.
pub struct GatewayClient {
client: reqwest::Client,
watcher_url: String, // e.g., "http://gateway:9001/watcher"
max_retries: u32,
}
impl GatewayClient {
/// Push a ChainEvent to gateway's /watcher endpoint
pub async fn submit_event(&self, event: &ChainEvent) -> Result<()> {
// Retry with exponential backoff for transient errors
// Returns Ok when gateway acknowledges receipt
}
}
Orchestration flow in watcher main loop:
loop {
tokio::select! {
// On-chain event from WebSocket listener
Some(onchain_event) = ws_rx.recv() => {
match onchain_event.event_type {
NewTaskCreated { task_id, task, .. } => {
if filter.is_direct_onchain(&task_id).await? {
gateway_client.submit_event(&onchain_event).await?;
}
// else: gateway already handled, skip
}
OperatorAdded { .. } | OperatorRemoved { .. } => {
// Always relay operator set changes
gateway_client.submit_event(&onchain_event).await?;
}
}
}
_ = cancellation_token.cancelled() => break,
}
}
Tests: Integration test with Redis + mock gateway RPC — write task IDs to Redis SET, emit matching on-chain events, verify they are
filtered out. Emit on-chain events with unseen IDs, verify submit_direct_task is called.
Verification: cargo test -p newton-chain-watcher (requires Redis: make local-dev)
---
0c: Create bin/newton-chain-watcher/ binary
Files to create:
- bin/newton-chain-watcher/Cargo.toml
- bin/newton-chain-watcher/src/main.rs
- bin/newton-chain-watcher/src/cli/mod.rs
Binary design:
- Uses single-chain NewtonAvsConfig<ChainWatcherConfig> (per-chain deployment, like challenger)
- Spawns: WebSocket listener task + main filter loop (checks Redis SET, pushes to gateway RPC)
- Exposes Prometheus metrics: events seen, events relayed, filter hit rate, WebSocket status
- Standalone — challenger remains a separate binary
CLI:
newton-chain-watcher start --chain-id 31337 [--config-path ...] [--redis-url ...] [--gateway-watcher-url ...]
Config:
pub struct ChainWatcherConfig {
pub redis_url: String,
pub gateway_watcher_url: String, // Gateway /watcher endpoint (e.g., "http://gateway:9001/watcher")
pub max_retries: u32, // Retries for gateway HTTP calls (default: 3)
}
impl ConfigLoader for ChainWatcherConfig {
const FILE_NAME: &'static str = "chain-watcher";
const ENV_PREFIX: &'static str = "CHAIN_WATCHER";
}
Tests: Smoke test with anvil + Redis — start binary, create task via gateway (should be filtered), create task directly on-chain
(should be relayed to gateway RPC).
Verification: cargo run --bin newton-chain-watcher -- start --chain-id 31337
---
Sub-Phase 1: Gateway Watcher Integration (Redis SET + RPC Endpoints)
Goal: Gateway marks task IDs in a Redis SET (for watcher filtering) and exposes RPC endpoints for the watcher to push direct on-chain
tasks and operator events. Dual-path: existing WebSocket listener coexists.
1a: Gateway marks task IDs in Redis SET
Files to modify:
- crates/gateway/src/rpc/api/sync.rs — after broadcasting task to operators, spawn background SADD to Redis SET
- crates/gateway/src/rpc/api/async_task.rs — same for async task path
- crates/gateway/src/handler/mod.rs — add optional deadpool_redis::Pool to GatewayServiceImpl
- crates/gateway/src/config.rs — add chain_watcher_redis_url: Option<String> to GatewayConfig
Design: If chain_watcher_redis_url is configured:
- Gateway creates a Redis connection pool at startup
- After each task broadcast, spawns a background tokio task:
// Fire-and-forget — does NOT block the offchain hot path
let pool = redis_pool.clone();
tokio::spawn(async move {
if let Ok(mut conn) = pool.get().await {
let key = format!("newton:seen-tasks:{}", chain_id);
let _: Result<(), _> = redis::cmd("SADD")
.arg(&key).arg(task_id.as_slice())
.query_async(&mut conn).await;
// SET TTL on first write (idempotent)
let _: Result<(), _> = redis::cmd("EXPIRE")
.arg(&key).arg(3600) // 1 hour TTL
.query_async(&mut conn).await;
}
});
- Gateway hot path cost: spawning a tokio task (~0 overhead). Redis write happens asynchronously.
- If Redis is down: task still processes normally. Watcher may relay duplicates — gateway deduplicates via gateway_tasks DashMap.
Backward compatibility: If chain_watcher_redis_url is not set, no Redis operations occur.
Verification: Gateway creates task → Redis SET contains the task ID (redis-cli SISMEMBER newton:seen-tasks:31337 <task_id>).
---
1b: Gateway exposes Axum endpoint for watcher events
Files to modify:
- crates/gateway/src/lib.rs — add /watcher Axum route (alongside existing /aggregate)
- crates/gateway/src/handler/mod.rs — add watcher_event_handler function
Design:
- New Axum POST endpoint: /watcher (NOT a jsonrpsee RPC method on /rpc)
- Follows the same pattern as /aggregate endpoint (line 119 of lib.rs):
- Separate Axum route, not on the /rpc jsonrpsee server
- No API key auth — same as /aggregate. API keys are for frontend clients/devs only. Internal infrastructure services (operators,
watchers) use other verification (BLS signatures for operators, on-chain event provenance for watchers)
- Auth middleware is soft (forwards without AuthenticatedApiKey extension) — but /watcher bypasses the RPC handler entirely, so API
key check is never reached
- Handler routes by ChainEvent.event_type:
- DirectOnchainTask → same flow as process_direct_onchain_tasks() (initialize aggregation, broadcast to operators, wait for quorum,
submit response)
- OperatorAdded/OperatorRemoved → update operator pool
- Returns acknowledgment JSON (watcher knows delivery succeeded, can retry on failure)
- Both paths coexist (WebSocket + watcher) — gateway deduplicates via gateway_tasks DashMap
Why /watcher Axum route instead of jsonrpsee RPC method on /rpc:
- /rpc requires API key authentication (enforced in the RPC handler). Watchers should not need API keys.
- Follows the established precedent: operators push to /aggregate (Axum route, no auth), not to /rpc
- Keeps internal service communication separated from client-facing RPC
// crates/gateway/src/lib.rs — alongside existing routes
.route("/aggregate", post(handler::operator_response_handler))
.route("/watcher", post(handler::watcher_event_handler)) // NEW
Verification:
- With chain watcher: direct on-chain tasks arrive via /watcher endpoint (no API key)
- Without chain watcher: direct on-chain tasks arrive via WebSocket (unchanged)
- make gateway_e2e passes in both modes
---
1c: Operator — no changes needed for watcher integration
Why no operator changes: In the hybrid design, the watcher pushes direct on-chain tasks to the gateway (via 1b's RPC endpoint). The
gateway then broadcasts these tasks to operators via the existing RPC broadcast path — same as offchain tasks. Operators don't need
to know whether a task originated offchain or was a direct on-chain task relayed by the watcher.
Flow: Watcher → Gateway RPC (newt_submitDirectOnchainTask) → Gateway broadcasts to operators → Operators evaluate + BLS sign →
/aggregate back to gateway.
This simplifies the plan: Operators remain pure task processors. No new config, no Redis, no watcher awareness.
Verification: make gateway_e2e passes unchanged. Direct on-chain tasks flow through gateway broadcast to operators.
---
1d: Gateway handles operator set events via /watcher endpoint
Files to modify:
- crates/gateway/src/handler/mod.rs — extend watcher_event_handler to route OperatorAdded/OperatorRemoved events
- crates/gateway/src/task/operator.rs — add method to update operator pool on add/remove events
Design: The /watcher endpoint from 1b already handles routing by ChainEvent.event_type. This step adds the
OperatorAdded/OperatorRemoved handlers that update the gateway's operator pool in real-time. This is the "event-driven" part of the
layered discovery (resolved Q1).
Verification: Register new operator on anvil → watcher detects event → POSTs to /watcher → gateway updates pool → new operator
receives tasks.
---
Sub-Phase 2: Multi-Chain Configuration Model
Goal: Define MultiChainConfig<T> that loads per-chain contract addresses and RPC configs. Single-chain mode remains the default —
multi-chain is opt-in.
2a: Define MultiChainConfig<T> and ChainContext
Files to create/modify:
- crates/core/src/config/multichain.rs — new module
- crates/core/src/config/mod.rs — add pub mod multichain;
Key structs:
pub struct ChainContext {
pub chain_id: u64,
pub contracts: ContractsConfig,
pub is_destination_chain: bool,
}
pub struct MultiChainConfig<T: ConfigLoader> {
pub source_chain_id: u64,
pub chains: HashMap<u64, ChainContext>,
pub rpc: ChainRpcProviderConfig,
pub service: T,
pub ipfs: IpfsConfig,
pub data_provider: DataProviderConfig,
}
// Backward compat
impl<T: ConfigLoader> From<NewtonAvsConfig<T>> for MultiChainConfig<T> { ... }
Contract loading: Loop over chain IDs, call existing ContractsConfig::load(chain_id, env) per chain.
Tests: Unit tests for config loading, From conversion.
Verification: cargo test -p newton-prover-core
---
2b: Gateway and Operator CLIs accept --chains flag
Files to modify:
- bin/newton-prover-gateway/src/cli/mod.rs — add --chains (comma-separated)
- bin/newton-prover-operator/src/cli/mod.rs — add --chains
Logic: If --chains → load MultiChainConfig. If --chain-id (existing) → load NewtonAvsConfig, convert via From. Mutual exclusion:
error if both provided.
Verification: Gateway starts with --chains 31337,31338 and loads both chain configs.
---
Sub-Phase 3: Multi-Chain Gateway (Core Refactor)
Goal: Gateway initializes per-chain ChainService instances and routes requests by chainId.
3a: Define ChainService and per-chain initialization
Files to modify:
- crates/gateway/src/handler/mod.rs — add ChainService struct, refactor init to loop over chains
ChainService:
pub struct ChainService {
pub chain_id: u64,
pub operator_pool: Arc<dyn OperatorPool>,
pub aggregator: Arc<dyn AggregatorService>,
pub avs_writer: Arc<dyn TaskSubmitter>,
pub task_response_window_block: u64,
pub block_time_ms: u64,
pub block_cache: Arc<BlockCache>,
pub task_manager_address: Address,
}
GatewayServiceImpl additions:
- chains: HashMap<u64, ChainService> field
- get_chain_service(chain_id) -> Result<&ChainService> method
- Single-chain: one entry (identical behavior to today)
- Multi-chain: one entry per configured chain
Per-chain initialization (loop over chains, existing code per iteration):
- OperatorPoolImpl — different operators per chain (per-chain operator sets)
- AggregatorCore — different TaskManager, quorum config
- AvsWriter — different chain RPC + contract
- BlockCache — per-chain block tracking
Shared across chains: DatabaseManager, AuthManager, RateLimiter, HPKE keypair, DataProvider, IndexerDb
Gateway operator mode: Disabled if chains.len() > 1 (resolved decision Q5).
Verification: Gateway starts with two chains, both ChainService instances initialize.
---
3b: Chain routing in RPC handlers
Files to modify:
- crates/gateway/src/rpc/api/sync.rs — extract chainId, route to ChainService
- crates/gateway/src/rpc/api/async_task.rs — same routing
- crates/gateway/src/handler/gateway.rs — chain routing helper
Routing: parse_intent() in crates/core/src/common/task.rs:351 already extracts chainId. Use it to call get_chain_service(chain_id).
If chainId missing, fall back to primary chain (backward compat).
Verification: Submit tasks with different chainId → routed to correct chain's TaskManager.
---
3c: Per-chain event handling + remove gateway WebSocket
Files to modify:
- crates/gateway/src/rpc/api/events.rs — route watcher events by chain_id to correct ChainService
- crates/gateway/src/handler/mod.rs — in multi-chain mode, don't spawn process_direct_onchain_tasks() (watcher handles it)
Single-chain mode: Retains WebSocket listener for backward compat.
Multi-chain mode: No WebSocket. Events arrive from per-chain watchers via RPC endpoints (1b, 1d).
Verification: Multi-chain gateway processes tasks for both chains. Direct on-chain task on chain A → watcher calls
newt_submitDirectOnchainTask → gateway routes to chain A's ChainService → processes via chain A's aggregator.
---
Sub-Phase 4: Multi-Chain Operators
Goal: Operators accept tasks from any chain they're registered on.
4a: Operator per-chain context and chain routing
Files to modify:
- crates/operator/src/builder.rs — accept multiple chain IDs, create per-chain contexts
- crates/operator/src/core.rs — route task processing by chainId
Per-chain context:
pub struct OperatorChainContext {
pub chain_id: u64,
pub task_signer: TaskSigner,
pub task_manager_addr: Address,
}
Shared: BLS keypair, ECDSA key, operator_id, policy engine (chain-agnostic).
Routing: Operator RPC handler receives full Task struct → extract chain_id → select correct TaskSigner → sign.
Verification: Gateway sends tasks for chains A and B to same operator. Operator signs both correctly.
---
4b: Remove operator WebSocket listener (multi-chain mode)
Files to modify:
- crates/operator/src/builder.rs — skip start_listener() in multi-chain mode
Single-chain: Retains WebSocket listener for backward compat.
Multi-chain: No WebSocket. Operators receive ALL tasks (offchain + direct on-chain) via gateway RPC broadcast. The watcher pushes
direct tasks to the gateway (1b), which then broadcasts to operators via the standard path.
Verification: Multi-chain operator receives tasks exclusively via gateway RPC broadcast. No WebSocket connections.
---
Sub-Phase 5: End-to-End Testing
5a: Multi-chain gateway E2E test
Files to create:
- integration-tests/tests/multichain_gateway.rs
- Makefile — multichain-gateway-e2e target
Test flow:
1. Start dual anvil (31337 + 31338) + Redis
2. Deploy contracts on both chains
3. Register operator on both chains' operator sets
4. Start ONE chain watcher per chain
5. Start ONE multi-chain gateway (--chains 31337,31338)
6. Start ONE multi-chain operator (--chains 31337,31338)
7. Submit offchain task for chain A → verify settlement on chain A
8. Submit offchain task for chain B → verify settlement on chain B
9. Create direct on-chain task on chain A → verify watcher filters and relays → gateway processes it
10. Create offchain task, then verify watcher does NOT relay it (smart filter)
Verification: make multichain-gateway-e2e
---
5b: Performance benchmarks
Metrics: RPC response time (p50/p95/p99) single vs multi-chain, memory per chain, Redis relay latency.
Acceptance: Multi-chain routing overhead < 1ms on hot path.
---
Execution Order
┌───────────┬──────────────────────────────────────────┬──────────────┬─────────────────────┐
│ Sub-Phase │ PR │ Dependencies │ Ships Independently │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 0a │ Chain watcher listener + events │ None │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 0b │ Smart filter + gateway RPC client │ 0a │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 0c │ Chain watcher binary │ 0a, 0b │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 1a │ Gateway marks task IDs in Redis SET │ None │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 1b │ Gateway RPC endpoint for direct tasks │ 1a │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 1c │ (No operator changes needed) │ — │ — │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 1d │ Gateway RPC endpoint for operator events │ None │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 2a │ MultiChainConfig model │ None │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 2b │ CLI --chains flag (gateway + operator) │ 2a │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 3a │ ChainService + per-chain init │ 2b │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 3b │ RPC chain routing │ 3a │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 3c │ Per-chain event handling + remove WS │ 3a, 1b │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 4a │ Operator per-chain context + routing │ 2b │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 4b │ Remove operator WebSocket │ 4a │ Yes │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 5a │ Multi-chain E2E test │ 3b, 4a, 0c │ Final validation │
├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤
│ 5b │ Performance benchmarks │ 5a │ Final validation │
└───────────┴──────────────────────────────────────────┴──────────────┴─────────────────────┘
Parallel tracks:
- Track A (Sub-Phase 0): Chain watcher crate + binary — start immediately
- Track B (Sub-Phase 2): Multi-chain config model — parallel with Track A
- Track C (Sub-Phase 1): Gateway Redis SET + RPC endpoints — parallel with Track A (no dependency on watcher)
- Track D (Sub-Phase 3, 4): Multi-chain gateway/operator — after Tracks B and C
Note: Tracks A, B, and C can all start in parallel. Track C (gateway changes) no longer depends on Track A (watcher) because the
gateway just writes to Redis SET and exposes RPC endpoints — both are independent of the watcher binary existing.
---
Verification Strategy
At each sub-phase:
1. Unit tests: New code has tests, existing tests pass
2. Clippy + fmt: make clippy && make fmt
3. Single-chain E2E: make gateway_e2e passes (backward compat)
4. Multi-chain E2E: make multichain-gateway-e2e passes (from 5a onward)
5. No performance regression: Benchmark hot paths before/after
@denniswon
Copy link
Copy Markdown
Author


Gateway-as-Operator Latency Analysis

★ Insight ─────────────────────────────────────
To understand the latency benefit, we need to trace the critical path — the chain of operations that determines the total
request-to-response time. The key insight is that quorum-based aggregation returns as soon as sufficient stake signs, so the gateway
operator's value isn't adding throughput — it's reducing the minimum time to quorum.
─────────────────────────────────────────────────

The Two Paths Compared

External Operator Path (network hop)

Gateway External Operator
│ │
├─ try_send(WorkerMessage) ──────────► mpsc queue (~0.01ms)
│ │
│ ├─ OperatorClient::single_call()
│ │ ├─ HTTP/2 POST serialize (~0.1ms)
│ │ ├─ Network round-trip (TCP+TLS) ◄── THIS IS THE BIG ONE
│ │ │ ├─ localhost: ~0.1-0.5ms
│ │ │ ├─ same datacenter: ~0.5-2ms
│ │ │ └─ cross-region: ~10-50ms
│ │ ├─ Operator RPC deserialization (~0.1ms)
│ │ ├─ IPFS policy fetch (cached: ~0ms, miss: ~50-200ms)
│ │ ├─ Rego evaluation (~1-10ms)
│ │ ├─ BLS signing (~0.5-1ms)
│ │ ├─ HTTP/2 response serialize (~0.1ms)
│ │ └─ Network return (~same as outbound)
│ ◄──────────────────────────────────── oneshot response

├─ Parse ConsensusCommitResponse (~0.05ms)
├─ Process BLS signature via aggregator (~0.1ms)
└─ Check quorum (stake-weighted) (~0.05ms)

Total for external operator: ~3-15ms (localhost) to ~30-100ms+ (cross-region)

Gateway Operator Path (in-process)

Gateway (same process)

├─ worker.submit_with_consensus_and_wait()
│ ├─ mpsc channel send (~0.01ms)
│ ├─ GatewayOperatorWorker picks up task
│ │ ├─ IPFS policy fetch (cached: ~0ms, miss: ~50-200ms)
│ │ ├─ Rego evaluation (~1-10ms)
│ │ ├─ BLS signing (~0.5-1ms)
│ │ └─ NO serialization, NO network hop
│ └─ oneshot response (~0.01ms)

├─ SignedTaskResponse directly available (~0ms)
└─ Process in unified quorum loop (~0.05ms)

Total for gateway operator: ~2-12ms (dominated by Rego eval + BLS sign)

Where the Savings Come From

┌───────────────────────────────────┬───────────────────┬──────────────────┬──────────────────────────┐
│ Component │ External Operator │ Gateway Operator │ Savings │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Request serialization (JSON) │ ~0.1ms │ 0 │ 0.1ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Network round-trip (localhost) │ ~0.2-1ms │ 0 │ 0.2-1ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Network round-trip (same DC) │ ~1-4ms │ 0 │ 1-4ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Network round-trip (cross-region) │ ~20-100ms │ 0 │ 20-100ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ HTTP/2 overhead (TLS, framing) │ ~0.1-0.3ms │ 0 │ 0.1-0.3ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Response deserialization │ ~0.1ms │ 0 │ 0.1ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ JSON-RPC parsing │ ~0.05ms │ 0 │ 0.05ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Total overhead eliminated │ │ │ ~0.5-2ms (localhost) │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ │ │ │ ~2-5ms (same DC) │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ │ │ │ ~20-100ms (cross-region) │
└───────────────────────────────────┴───────────────────┴──────────────────┴──────────────────────────┘

The Critical Path: Quorum Completion

The real question isn't "how much faster is one operator?" — it's "how much faster does quorum complete?"

Looking at broadcast_with_unified_quorum (operator.rs:2090), the flow is:

  1. All operators (external + gateway) are launched in parallel via tokio::select!
  2. The loop checks quorum_reached after each response arrives
  3. It exits early as soon as quorum is met

With a 67% quorum threshold, here are the scenarios:

Scenario A: 2 operators (gateway + 1 external), 67% quorum

  • Need 67% of total stake → need at least 1 operator's stake (assuming roughly equal)
  • With gateway operator: First response is the gateway operator (~2-12ms, no network). Quorum met immediately. External operator's
    response is irrelevant.
  • Without gateway operator: Must wait for external operator (~3-15ms localhost, ~5-20ms same DC). Quorum met on first external
    response.
  • Savings: The entire network round-trip — gateway operator always wins the race because it has zero network overhead.

Scenario B: 3 operators (gateway + 2 external), 67% quorum

  • Need 67% of total stake → need at least 2 operators' signatures
  • With gateway operator: Gateway responds first (~2-12ms). Need 1 more external. Total = gateway time + fastest external.
  • Without gateway operator: Need 2 externals. Total = second-fastest external response time.
  • Savings: Gateway response is "free" — you're waiting for 1 external instead of 2. In practice this means the slower external
    operator doesn't block you.

Scenario C: N operators (gateway + N-1 external), 67% quorum

  • Gateway always provides one "free" signature immediately
  • You need ceil(0.67 * N) - 1 external responses instead of ceil(0.67 * N)
  • Savings diminish as N grows, because you're saving 1 out of many required responses

The Real Latency Impact

                  Latency breakdown (centralized mode, 67% quorum)

Task received ──► Generate policyTaskData ──► Broadcast ──► Wait for quorum ──► Aggregate ──► Submit
│ │ │ │
~1-5ms ~0.1ms VARIABLE ~1ms
(data gen + attestation) (try_send) (THE BOTTLENECK) (BLS aggregate)

The quorum wait is the bottleneck. Everything else is fixed overhead. Gateway-as-operator directly reduces this wait:

┌────────────────────┬──────────────────────────┬──────────────────────────┐
│ Deployment │ Quorum Wait (2 ops, 67%) │ Quorum Wait (3 ops, 67%) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Localhost (dev) │ │ │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Without gateway op │ ~3-5ms (1 ext needed) │ ~3-5ms (fastest 2 ext) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ With gateway op │ ~2ms (gateway wins) │ ~3-5ms (gw + 1 ext) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Savings │ ~1-3ms (~30-50%) │ ~0-1ms (~10%) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Same datacenter │ │ │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Without gateway op │ ~5-10ms │ ~8-15ms (wait for 2nd) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ With gateway op │ ~2-3ms (gateway wins) │ ~5-10ms (gw + fastest) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Savings │ ~3-7ms (~50-70%) │ ~3-5ms (~30%) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Cross-region │ │ │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Without gateway op │ ~30-80ms │ ~50-120ms │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ With gateway op │ ~2-3ms (gateway wins) │ ~30-80ms (gw + fastest) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Savings │ ~25-75ms (~80-90%) │ ~20-40ms (~30-40%) │
└────────────────────┴──────────────────────────┴──────────────────────────┘

Key Takeaways

  1. Gateway operator is most impactful with 2 operators at 67% quorum — the most common deployment. It turns the quorum wait from
    "wait for network" to "instant" because 1 signature is enough. This is where the 80-90% savings happen in cross-region setups.
  2. The benefit is proportional to network latency. On localhost (dev), you save ~1-3ms. Cross-region, you save ~25-75ms. In
    production deployments where operators are geographically distributed, this is significant.
  3. The benefit diminishes with more operators. With 5+ operators and 67% quorum, you need 4 signatures. Having 1 free from the
    gateway saves you waiting for the 4th-fastest instead of the 4th-fastest external — marginal.
  4. IPFS cache misses dominate. If the policy CID isn't cached, the ~50-200ms IPFS fetch dwarfs any network savings. But the gateway
    operator uses the same IPFS cache as the gateway, so cache hits are more likely.
  5. For multi-chain: The argument for gateway-as-operator is actually stronger than single-chain in cross-region setups, because
    multi-chain operators are more likely to be distributed across regions. The registration complexity cost is real but the latency
    benefit per-task is larger.

Recommendation

The current decision to disable gateway operator mode in multi-chain is correct for Phase 1 — it avoids the registration complexity.
But for Phase 2, it's worth revisiting with metrics:

  • If operators are co-located (same DC), savings are ~3-5ms — probably not worth the complexity
  • If operators are cross-region, savings are ~25-75ms per task — significant enough to justify per-chain registration

The implementation path would be: embed OperatorChainContext into ChainService (you already built the abstraction in Sub-Phase 4a),
register the gateway as an operator on each chain during deployment, and switch the primary chain's existing operator_core to a
per-chain model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment