Created
February 28, 2026 23:49
-
-
Save denniswon/c303e43a9629142dcac51d7a493554ff to your computer and use it in GitHub Desktop.
Cross-Chain Gateway — Phase 1 Implementation Plan
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ★ Insight ───────────────────────────────────── | |
| Two operator-gateway communication patterns exist: | |
| 1. Sync path (newt_createTask): Gateway calls operator RPC (newt_evaluateAndSign), collects SignedTaskResponse as inline response. | |
| Gateway drives the flow. | |
| 2. Async path (newt_sendTask): Gateway fires process_task to operator (returns immediately), operator processes in background, then | |
| pushes SignedTaskResponse back to gateway's /aggregate Axum endpoint via ClientAggregator. | |
| The /aggregate endpoint is a separate Axum route (not on /rpc jsonrpsee server), with no API key auth — BLS signature verification | |
| serves as authentication. The watcher's /watcher endpoint follows this exact same pattern. | |
| ───────────────────────────────────────────────── | |
| ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── | |
| Ready to code? | |
| Here is Claude's plan: | |
| ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ | |
| Cross-Chain Gateway — Phase 1 Implementation Plan | |
| Context | |
| Newton Protocol currently requires deploying a full protocol stack (gateway, operators, challenger) per chain. The | |
| docs/CROSS_CHAIN_GATEWAY.md architecture doc defines a multi-chain design where one gateway and one operator instance can serve N | |
| chains. This plan breaks Phase 1 into incremental sub-phases, each independently shippable, testable, and rollback-safe. | |
| Branch strategy: Each sub-phase is a separate PR to main. No long-lived feature branch. | |
| Key constraint: Backward compatibility — existing single-chain deployments must work unchanged at every step. | |
| --- | |
| Chain Watcher: Smart Filter Design | |
| The Chain Watcher is NOT a dumb event relay. It is a smart filter that distinguishes direct on-chain tasks from gateway-originated | |
| tasks. | |
| Communication: Hybrid Approach (Redis SET + RPC) | |
| Optimized for minimum gateway hot-path load. The gateway's only overhead is one async Redis SADD per offchain task (~0.1ms, | |
| non-blocking, fire-and-forget). | |
| Offchain path (99.9%): | |
| Client → Gateway → broadcasts task to operators → BLS aggregation → on-chain | |
| → spawns background SADD to Redis SET (fire-and-forget) | |
| key: newton:seen-tasks:{chain_id}, value: task_id, TTL: 1 hour | |
| Direct on-chain path (0.1% — censorship resistance): | |
| Someone → creates task directly on-chain via TaskManager.createTask() | |
| Chain Watcher → detects NewTaskCreated event via WebSocket | |
| Chain Watcher → checks Redis SET: SISMEMBER newton:seen-tasks:{chain_id} task_id | |
| EXISTS → gateway already handled it → skip | |
| NOT EXISTS → direct on-chain task → calls gateway RPC: newt_submitDirectOnchainTask | |
| Gateway → processes task (initialize aggregation, wait for quorum, submit response) | |
| Why this design over alternatives: | |
| ┌──────────────────────┬───────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐ | |
| │ Direction │ Approach │ Why │ | |
| ├──────────────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Gateway → Watcher │ Redis SET │ Non-blocking, off hot path. If Redis down, watcher relays duplicates — gateway │ | |
| │ (task ID tracking) │ (SADD + TTL) │ deduplicates via gateway_tasks DashMap. Zero impact on offchain path. │ | |
| ├──────────────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Watcher → Gateway │ Axum POST to │ Reliable delivery with retry. Direct tasks are censorship-resistance critical (0.1% of │ | |
| │ (direct task relay) │ /watcher │ traffic). Consistent with operator → gateway /aggregate pattern. No API key auth │ | |
| │ │ │ (internal service). │ | |
| ├──────────────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Watcher → Gateway │ Axum POST to │ Same endpoint, routed by event_type. Same rationale — reliable, low-frequency, │ | |
| │ (operator events) │ /watcher │ consistent pattern. │ | |
| └──────────────────────┴───────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘ | |
| Why NOT Redis Pub/Sub for watcher → gateway: | |
| - Pub/Sub is fire-and-forget — if gateway restarts, messages are lost | |
| - Direct on-chain tasks are censorship-resistance critical — reliability > simplicity | |
| - HTTP POST gives watcher confirmation that gateway received the event (retry on failure) | |
| - Consistent with existing operator → gateway /aggregate pattern (Axum route, no API key, reqwest POST) | |
| Why NOT migrate operator /aggregate to Redis: | |
| - Operators need request/response semantics (signature acceptance confirmation, quorum progress) | |
| - /aggregate already works well — no reason to change a working pattern | |
| - Future migration to NATS JetStream (docs/STREAMING_CONSENSUS.md) is the right time to revisit | |
| Why Chain Watcher and Challenger Are Separate Binaries | |
| ┌──────────────┬────────────────────────────────────────────┬─────────────────────────────────────────────────────────┐ | |
| │ Concern │ Chain Watcher │ Challenger │ | |
| ├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ CPU profile │ Lightweight — event filtering, LRU lookups │ Heavy — ZK proof generation (SP1), policy re-evaluation │ | |
| ├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Availability │ Always-on, real-time (~ms latency) │ Tolerates delays (100-block challenge window) │ | |
| ├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Failure mode │ Must never miss events │ Can retry proofs within window │ | |
| ├──────────────┼────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Scaling │ 1 per chain (HA pair later) │ 1 per chain, CPU-bound │ | |
| └──────────────┴────────────────────────────────────────────┴─────────────────────────────────────────────────────────┘ | |
| Risk of combining: ZK proof generation is CPU-intensive. A catch_unwind-escaped panic or CPU starvation during proof gen could | |
| degrade or kill the watcher — gateway loses its on-chain event feed. That's a censorship-resistance failure. | |
| Combine later if: running 2N containers creates material ops overhead AND profiling confirms challenger CPU load doesn't compete with | |
| watcher event loop. | |
| --- | |
| Sub-Phase 0: Chain Watcher Crate (Smart Event Filter) | |
| Goal: Create crates/chain-watcher/ crate with a smart event filter that tracks gateway-broadcasted tasks and only relays true direct | |
| on-chain tasks. | |
| 0a: Create crates/chain-watcher/ crate — listener + event types | |
| Files to create: | |
| - crates/chain-watcher/Cargo.toml | |
| - crates/chain-watcher/src/lib.rs | |
| - crates/chain-watcher/src/listener.rs — ChainWatcherListener (WebSocket → mpsc) | |
| - crates/chain-watcher/src/event.rs — ChainEvent, ChainEventType enums | |
| Pattern to reuse: crates/challenger/src/dest/listener.rs:57-87 | |
| - Spawned background task with tokio::spawn() | |
| - Bounded mpsc::channel(100) for backpressure | |
| - WebSocket reconnection with exponential backoff + max retries | |
| - CancellationToken for graceful shutdown | |
| - Returns mpsc::Receiver<ChainEvent> | |
| Event types: | |
| #[derive(Debug, Clone, Serialize, Deserialize)] | |
| pub struct ChainEvent { | |
| pub chain_id: u64, | |
| pub event_type: ChainEventType, | |
| pub block_number: u64, | |
| pub tx_hash: B256, | |
| } | |
| #[derive(Debug, Clone, Serialize, Deserialize)] | |
| pub enum ChainEventType { | |
| /// Direct on-chain task (not seen in gateway broadcasts) | |
| DirectOnchainTask { task_index: u32, task: Task }, | |
| /// Operator added to an operator set | |
| OperatorAdded { operator_id: FixedBytes<32>, operator_set_id: u32 }, | |
| /// Operator removed from an operator set | |
| OperatorRemoved { operator_id: FixedBytes<32>, operator_set_id: u32 }, | |
| } | |
| Key files to reference: | |
| - crates/challenger/src/dest/listener.rs — WebSocket reconnection pattern | |
| - crates/transporter/src/listener.rs — SourceChainListener (multi-event subscription) | |
| - crates/gateway/src/rpc/api/events.rs — current gateway event listener logic | |
| Tests: Unit tests for event serialization. Integration test with anvil — emit NewTaskCreated, verify listener captures it. | |
| Verification: cargo test -p newton-chain-watcher | |
| --- | |
| 0b: Add smart filter logic + gateway RPC client | |
| Files to create/modify: | |
| - crates/chain-watcher/src/filter.rs — TaskFilter (checks Redis SET for seen task IDs) | |
| - crates/chain-watcher/src/gateway_client.rs — HTTP client to push events to gateway RPC | |
| - crates/chain-watcher/Cargo.toml — add redis, reqwest dependencies | |
| TaskFilter design (Redis SET-backed): | |
| pub struct TaskFilter { | |
| redis: deadpool_redis::Pool, | |
| chain_id: u64, | |
| } | |
| impl TaskFilter { | |
| /// Check if a task was already handled by gateway (exists in Redis SET) | |
| pub async fn is_seen_by_gateway(&self, task_id: &B256) -> Result<bool> { | |
| let key = format!("newton:seen-tasks:{}", self.chain_id); | |
| let mut conn = self.redis.get().await?; | |
| let exists: bool = redis::cmd("SISMEMBER") | |
| .arg(&key) | |
| .arg(task_id.as_slice()) | |
| .query_async(&mut conn) | |
| .await?; | |
| Ok(exists) | |
| } | |
| /// Check if task is a direct on-chain task (NOT seen by gateway) | |
| pub async fn is_direct_onchain(&self, task_id: &B256) -> Result<bool> { | |
| Ok(!self.is_seen_by_gateway(task_id).await?) | |
| } | |
| } | |
| GatewayClient (HTTP POST to /watcher with retry): | |
| /// Client for pushing events to gateway's /watcher endpoint. | |
| /// Pattern matches operator's ClientAggregator (crates/operator/src/client.rs) | |
| /// which POSTs to gateway's /aggregate endpoint. | |
| pub struct GatewayClient { | |
| client: reqwest::Client, | |
| watcher_url: String, // e.g., "http://gateway:9001/watcher" | |
| max_retries: u32, | |
| } | |
| impl GatewayClient { | |
| /// Push a ChainEvent to gateway's /watcher endpoint | |
| pub async fn submit_event(&self, event: &ChainEvent) -> Result<()> { | |
| // Retry with exponential backoff for transient errors | |
| // Returns Ok when gateway acknowledges receipt | |
| } | |
| } | |
| Orchestration flow in watcher main loop: | |
| loop { | |
| tokio::select! { | |
| // On-chain event from WebSocket listener | |
| Some(onchain_event) = ws_rx.recv() => { | |
| match onchain_event.event_type { | |
| NewTaskCreated { task_id, task, .. } => { | |
| if filter.is_direct_onchain(&task_id).await? { | |
| gateway_client.submit_event(&onchain_event).await?; | |
| } | |
| // else: gateway already handled, skip | |
| } | |
| OperatorAdded { .. } | OperatorRemoved { .. } => { | |
| // Always relay operator set changes | |
| gateway_client.submit_event(&onchain_event).await?; | |
| } | |
| } | |
| } | |
| _ = cancellation_token.cancelled() => break, | |
| } | |
| } | |
| Tests: Integration test with Redis + mock gateway RPC — write task IDs to Redis SET, emit matching on-chain events, verify they are | |
| filtered out. Emit on-chain events with unseen IDs, verify submit_direct_task is called. | |
| Verification: cargo test -p newton-chain-watcher (requires Redis: make local-dev) | |
| --- | |
| 0c: Create bin/newton-chain-watcher/ binary | |
| Files to create: | |
| - bin/newton-chain-watcher/Cargo.toml | |
| - bin/newton-chain-watcher/src/main.rs | |
| - bin/newton-chain-watcher/src/cli/mod.rs | |
| Binary design: | |
| - Uses single-chain NewtonAvsConfig<ChainWatcherConfig> (per-chain deployment, like challenger) | |
| - Spawns: WebSocket listener task + main filter loop (checks Redis SET, pushes to gateway RPC) | |
| - Exposes Prometheus metrics: events seen, events relayed, filter hit rate, WebSocket status | |
| - Standalone — challenger remains a separate binary | |
| CLI: | |
| newton-chain-watcher start --chain-id 31337 [--config-path ...] [--redis-url ...] [--gateway-watcher-url ...] | |
| Config: | |
| pub struct ChainWatcherConfig { | |
| pub redis_url: String, | |
| pub gateway_watcher_url: String, // Gateway /watcher endpoint (e.g., "http://gateway:9001/watcher") | |
| pub max_retries: u32, // Retries for gateway HTTP calls (default: 3) | |
| } | |
| impl ConfigLoader for ChainWatcherConfig { | |
| const FILE_NAME: &'static str = "chain-watcher"; | |
| const ENV_PREFIX: &'static str = "CHAIN_WATCHER"; | |
| } | |
| Tests: Smoke test with anvil + Redis — start binary, create task via gateway (should be filtered), create task directly on-chain | |
| (should be relayed to gateway RPC). | |
| Verification: cargo run --bin newton-chain-watcher -- start --chain-id 31337 | |
| --- | |
| Sub-Phase 1: Gateway Watcher Integration (Redis SET + RPC Endpoints) | |
| Goal: Gateway marks task IDs in a Redis SET (for watcher filtering) and exposes RPC endpoints for the watcher to push direct on-chain | |
| tasks and operator events. Dual-path: existing WebSocket listener coexists. | |
| 1a: Gateway marks task IDs in Redis SET | |
| Files to modify: | |
| - crates/gateway/src/rpc/api/sync.rs — after broadcasting task to operators, spawn background SADD to Redis SET | |
| - crates/gateway/src/rpc/api/async_task.rs — same for async task path | |
| - crates/gateway/src/handler/mod.rs — add optional deadpool_redis::Pool to GatewayServiceImpl | |
| - crates/gateway/src/config.rs — add chain_watcher_redis_url: Option<String> to GatewayConfig | |
| Design: If chain_watcher_redis_url is configured: | |
| - Gateway creates a Redis connection pool at startup | |
| - After each task broadcast, spawns a background tokio task: | |
| // Fire-and-forget — does NOT block the offchain hot path | |
| let pool = redis_pool.clone(); | |
| tokio::spawn(async move { | |
| if let Ok(mut conn) = pool.get().await { | |
| let key = format!("newton:seen-tasks:{}", chain_id); | |
| let _: Result<(), _> = redis::cmd("SADD") | |
| .arg(&key).arg(task_id.as_slice()) | |
| .query_async(&mut conn).await; | |
| // SET TTL on first write (idempotent) | |
| let _: Result<(), _> = redis::cmd("EXPIRE") | |
| .arg(&key).arg(3600) // 1 hour TTL | |
| .query_async(&mut conn).await; | |
| } | |
| }); | |
| - Gateway hot path cost: spawning a tokio task (~0 overhead). Redis write happens asynchronously. | |
| - If Redis is down: task still processes normally. Watcher may relay duplicates — gateway deduplicates via gateway_tasks DashMap. | |
| Backward compatibility: If chain_watcher_redis_url is not set, no Redis operations occur. | |
| Verification: Gateway creates task → Redis SET contains the task ID (redis-cli SISMEMBER newton:seen-tasks:31337 <task_id>). | |
| --- | |
| 1b: Gateway exposes Axum endpoint for watcher events | |
| Files to modify: | |
| - crates/gateway/src/lib.rs — add /watcher Axum route (alongside existing /aggregate) | |
| - crates/gateway/src/handler/mod.rs — add watcher_event_handler function | |
| Design: | |
| - New Axum POST endpoint: /watcher (NOT a jsonrpsee RPC method on /rpc) | |
| - Follows the same pattern as /aggregate endpoint (line 119 of lib.rs): | |
| - Separate Axum route, not on the /rpc jsonrpsee server | |
| - No API key auth — same as /aggregate. API keys are for frontend clients/devs only. Internal infrastructure services (operators, | |
| watchers) use other verification (BLS signatures for operators, on-chain event provenance for watchers) | |
| - Auth middleware is soft (forwards without AuthenticatedApiKey extension) — but /watcher bypasses the RPC handler entirely, so API | |
| key check is never reached | |
| - Handler routes by ChainEvent.event_type: | |
| - DirectOnchainTask → same flow as process_direct_onchain_tasks() (initialize aggregation, broadcast to operators, wait for quorum, | |
| submit response) | |
| - OperatorAdded/OperatorRemoved → update operator pool | |
| - Returns acknowledgment JSON (watcher knows delivery succeeded, can retry on failure) | |
| - Both paths coexist (WebSocket + watcher) — gateway deduplicates via gateway_tasks DashMap | |
| Why /watcher Axum route instead of jsonrpsee RPC method on /rpc: | |
| - /rpc requires API key authentication (enforced in the RPC handler). Watchers should not need API keys. | |
| - Follows the established precedent: operators push to /aggregate (Axum route, no auth), not to /rpc | |
| - Keeps internal service communication separated from client-facing RPC | |
| // crates/gateway/src/lib.rs — alongside existing routes | |
| .route("/aggregate", post(handler::operator_response_handler)) | |
| .route("/watcher", post(handler::watcher_event_handler)) // NEW | |
| Verification: | |
| - With chain watcher: direct on-chain tasks arrive via /watcher endpoint (no API key) | |
| - Without chain watcher: direct on-chain tasks arrive via WebSocket (unchanged) | |
| - make gateway_e2e passes in both modes | |
| --- | |
| 1c: Operator — no changes needed for watcher integration | |
| Why no operator changes: In the hybrid design, the watcher pushes direct on-chain tasks to the gateway (via 1b's RPC endpoint). The | |
| gateway then broadcasts these tasks to operators via the existing RPC broadcast path — same as offchain tasks. Operators don't need | |
| to know whether a task originated offchain or was a direct on-chain task relayed by the watcher. | |
| Flow: Watcher → Gateway RPC (newt_submitDirectOnchainTask) → Gateway broadcasts to operators → Operators evaluate + BLS sign → | |
| /aggregate back to gateway. | |
| This simplifies the plan: Operators remain pure task processors. No new config, no Redis, no watcher awareness. | |
| Verification: make gateway_e2e passes unchanged. Direct on-chain tasks flow through gateway broadcast to operators. | |
| --- | |
| 1d: Gateway handles operator set events via /watcher endpoint | |
| Files to modify: | |
| - crates/gateway/src/handler/mod.rs — extend watcher_event_handler to route OperatorAdded/OperatorRemoved events | |
| - crates/gateway/src/task/operator.rs — add method to update operator pool on add/remove events | |
| Design: The /watcher endpoint from 1b already handles routing by ChainEvent.event_type. This step adds the | |
| OperatorAdded/OperatorRemoved handlers that update the gateway's operator pool in real-time. This is the "event-driven" part of the | |
| layered discovery (resolved Q1). | |
| Verification: Register new operator on anvil → watcher detects event → POSTs to /watcher → gateway updates pool → new operator | |
| receives tasks. | |
| --- | |
| Sub-Phase 2: Multi-Chain Configuration Model | |
| Goal: Define MultiChainConfig<T> that loads per-chain contract addresses and RPC configs. Single-chain mode remains the default — | |
| multi-chain is opt-in. | |
| 2a: Define MultiChainConfig<T> and ChainContext | |
| Files to create/modify: | |
| - crates/core/src/config/multichain.rs — new module | |
| - crates/core/src/config/mod.rs — add pub mod multichain; | |
| Key structs: | |
| pub struct ChainContext { | |
| pub chain_id: u64, | |
| pub contracts: ContractsConfig, | |
| pub is_destination_chain: bool, | |
| } | |
| pub struct MultiChainConfig<T: ConfigLoader> { | |
| pub source_chain_id: u64, | |
| pub chains: HashMap<u64, ChainContext>, | |
| pub rpc: ChainRpcProviderConfig, | |
| pub service: T, | |
| pub ipfs: IpfsConfig, | |
| pub data_provider: DataProviderConfig, | |
| } | |
| // Backward compat | |
| impl<T: ConfigLoader> From<NewtonAvsConfig<T>> for MultiChainConfig<T> { ... } | |
| Contract loading: Loop over chain IDs, call existing ContractsConfig::load(chain_id, env) per chain. | |
| Tests: Unit tests for config loading, From conversion. | |
| Verification: cargo test -p newton-prover-core | |
| --- | |
| 2b: Gateway and Operator CLIs accept --chains flag | |
| Files to modify: | |
| - bin/newton-prover-gateway/src/cli/mod.rs — add --chains (comma-separated) | |
| - bin/newton-prover-operator/src/cli/mod.rs — add --chains | |
| Logic: If --chains → load MultiChainConfig. If --chain-id (existing) → load NewtonAvsConfig, convert via From. Mutual exclusion: | |
| error if both provided. | |
| Verification: Gateway starts with --chains 31337,31338 and loads both chain configs. | |
| --- | |
| Sub-Phase 3: Multi-Chain Gateway (Core Refactor) | |
| Goal: Gateway initializes per-chain ChainService instances and routes requests by chainId. | |
| 3a: Define ChainService and per-chain initialization | |
| Files to modify: | |
| - crates/gateway/src/handler/mod.rs — add ChainService struct, refactor init to loop over chains | |
| ChainService: | |
| pub struct ChainService { | |
| pub chain_id: u64, | |
| pub operator_pool: Arc<dyn OperatorPool>, | |
| pub aggregator: Arc<dyn AggregatorService>, | |
| pub avs_writer: Arc<dyn TaskSubmitter>, | |
| pub task_response_window_block: u64, | |
| pub block_time_ms: u64, | |
| pub block_cache: Arc<BlockCache>, | |
| pub task_manager_address: Address, | |
| } | |
| GatewayServiceImpl additions: | |
| - chains: HashMap<u64, ChainService> field | |
| - get_chain_service(chain_id) -> Result<&ChainService> method | |
| - Single-chain: one entry (identical behavior to today) | |
| - Multi-chain: one entry per configured chain | |
| Per-chain initialization (loop over chains, existing code per iteration): | |
| - OperatorPoolImpl — different operators per chain (per-chain operator sets) | |
| - AggregatorCore — different TaskManager, quorum config | |
| - AvsWriter — different chain RPC + contract | |
| - BlockCache — per-chain block tracking | |
| Shared across chains: DatabaseManager, AuthManager, RateLimiter, HPKE keypair, DataProvider, IndexerDb | |
| Gateway operator mode: Disabled if chains.len() > 1 (resolved decision Q5). | |
| Verification: Gateway starts with two chains, both ChainService instances initialize. | |
| --- | |
| 3b: Chain routing in RPC handlers | |
| Files to modify: | |
| - crates/gateway/src/rpc/api/sync.rs — extract chainId, route to ChainService | |
| - crates/gateway/src/rpc/api/async_task.rs — same routing | |
| - crates/gateway/src/handler/gateway.rs — chain routing helper | |
| Routing: parse_intent() in crates/core/src/common/task.rs:351 already extracts chainId. Use it to call get_chain_service(chain_id). | |
| If chainId missing, fall back to primary chain (backward compat). | |
| Verification: Submit tasks with different chainId → routed to correct chain's TaskManager. | |
| --- | |
| 3c: Per-chain event handling + remove gateway WebSocket | |
| Files to modify: | |
| - crates/gateway/src/rpc/api/events.rs — route watcher events by chain_id to correct ChainService | |
| - crates/gateway/src/handler/mod.rs — in multi-chain mode, don't spawn process_direct_onchain_tasks() (watcher handles it) | |
| Single-chain mode: Retains WebSocket listener for backward compat. | |
| Multi-chain mode: No WebSocket. Events arrive from per-chain watchers via RPC endpoints (1b, 1d). | |
| Verification: Multi-chain gateway processes tasks for both chains. Direct on-chain task on chain A → watcher calls | |
| newt_submitDirectOnchainTask → gateway routes to chain A's ChainService → processes via chain A's aggregator. | |
| --- | |
| Sub-Phase 4: Multi-Chain Operators | |
| Goal: Operators accept tasks from any chain they're registered on. | |
| 4a: Operator per-chain context and chain routing | |
| Files to modify: | |
| - crates/operator/src/builder.rs — accept multiple chain IDs, create per-chain contexts | |
| - crates/operator/src/core.rs — route task processing by chainId | |
| Per-chain context: | |
| pub struct OperatorChainContext { | |
| pub chain_id: u64, | |
| pub task_signer: TaskSigner, | |
| pub task_manager_addr: Address, | |
| } | |
| Shared: BLS keypair, ECDSA key, operator_id, policy engine (chain-agnostic). | |
| Routing: Operator RPC handler receives full Task struct → extract chain_id → select correct TaskSigner → sign. | |
| Verification: Gateway sends tasks for chains A and B to same operator. Operator signs both correctly. | |
| --- | |
| 4b: Remove operator WebSocket listener (multi-chain mode) | |
| Files to modify: | |
| - crates/operator/src/builder.rs — skip start_listener() in multi-chain mode | |
| Single-chain: Retains WebSocket listener for backward compat. | |
| Multi-chain: No WebSocket. Operators receive ALL tasks (offchain + direct on-chain) via gateway RPC broadcast. The watcher pushes | |
| direct tasks to the gateway (1b), which then broadcasts to operators via the standard path. | |
| Verification: Multi-chain operator receives tasks exclusively via gateway RPC broadcast. No WebSocket connections. | |
| --- | |
| Sub-Phase 5: End-to-End Testing | |
| 5a: Multi-chain gateway E2E test | |
| Files to create: | |
| - integration-tests/tests/multichain_gateway.rs | |
| - Makefile — multichain-gateway-e2e target | |
| Test flow: | |
| 1. Start dual anvil (31337 + 31338) + Redis | |
| 2. Deploy contracts on both chains | |
| 3. Register operator on both chains' operator sets | |
| 4. Start ONE chain watcher per chain | |
| 5. Start ONE multi-chain gateway (--chains 31337,31338) | |
| 6. Start ONE multi-chain operator (--chains 31337,31338) | |
| 7. Submit offchain task for chain A → verify settlement on chain A | |
| 8. Submit offchain task for chain B → verify settlement on chain B | |
| 9. Create direct on-chain task on chain A → verify watcher filters and relays → gateway processes it | |
| 10. Create offchain task, then verify watcher does NOT relay it (smart filter) | |
| Verification: make multichain-gateway-e2e | |
| --- | |
| 5b: Performance benchmarks | |
| Metrics: RPC response time (p50/p95/p99) single vs multi-chain, memory per chain, Redis relay latency. | |
| Acceptance: Multi-chain routing overhead < 1ms on hot path. | |
| --- | |
| Execution Order | |
| ┌───────────┬──────────────────────────────────────────┬──────────────┬─────────────────────┐ | |
| │ Sub-Phase │ PR │ Dependencies │ Ships Independently │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 0a │ Chain watcher listener + events │ None │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 0b │ Smart filter + gateway RPC client │ 0a │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 0c │ Chain watcher binary │ 0a, 0b │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 1a │ Gateway marks task IDs in Redis SET │ None │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 1b │ Gateway RPC endpoint for direct tasks │ 1a │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 1c │ (No operator changes needed) │ — │ — │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 1d │ Gateway RPC endpoint for operator events │ None │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 2a │ MultiChainConfig model │ None │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 2b │ CLI --chains flag (gateway + operator) │ 2a │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 3a │ ChainService + per-chain init │ 2b │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 3b │ RPC chain routing │ 3a │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 3c │ Per-chain event handling + remove WS │ 3a, 1b │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 4a │ Operator per-chain context + routing │ 2b │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 4b │ Remove operator WebSocket │ 4a │ Yes │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 5a │ Multi-chain E2E test │ 3b, 4a, 0c │ Final validation │ | |
| ├───────────┼──────────────────────────────────────────┼──────────────┼─────────────────────┤ | |
| │ 5b │ Performance benchmarks │ 5a │ Final validation │ | |
| └───────────┴──────────────────────────────────────────┴──────────────┴─────────────────────┘ | |
| Parallel tracks: | |
| - Track A (Sub-Phase 0): Chain watcher crate + binary — start immediately | |
| - Track B (Sub-Phase 2): Multi-chain config model — parallel with Track A | |
| - Track C (Sub-Phase 1): Gateway Redis SET + RPC endpoints — parallel with Track A (no dependency on watcher) | |
| - Track D (Sub-Phase 3, 4): Multi-chain gateway/operator — after Tracks B and C | |
| Note: Tracks A, B, and C can all start in parallel. Track C (gateway changes) no longer depends on Track A (watcher) because the | |
| gateway just writes to Redis SET and exposes RPC endpoints — both are independent of the watcher binary existing. | |
| --- | |
| Verification Strategy | |
| At each sub-phase: | |
| 1. Unit tests: New code has tests, existing tests pass | |
| 2. Clippy + fmt: make clippy && make fmt | |
| 3. Single-chain E2E: make gateway_e2e passes (backward compat) | |
| 4. Multi-chain E2E: make multichain-gateway-e2e passes (from 5a onward) | |
| 5. No performance regression: Benchmark hot paths before/after |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Gateway-as-Operator Latency Analysis
★ Insight ─────────────────────────────────────
To understand the latency benefit, we need to trace the critical path — the chain of operations that determines the total
request-to-response time. The key insight is that quorum-based aggregation returns as soon as sufficient stake signs, so the gateway
operator's value isn't adding throughput — it's reducing the minimum time to quorum.
─────────────────────────────────────────────────
The Two Paths Compared
External Operator Path (network hop)
Gateway External Operator
│ │
├─ try_send(WorkerMessage) ──────────► mpsc queue (~0.01ms)
│ │
│ ├─ OperatorClient::single_call()
│ │ ├─ HTTP/2 POST serialize (~0.1ms)
│ │ ├─ Network round-trip (TCP+TLS) ◄── THIS IS THE BIG ONE
│ │ │ ├─ localhost: ~0.1-0.5ms
│ │ │ ├─ same datacenter: ~0.5-2ms
│ │ │ └─ cross-region: ~10-50ms
│ │ ├─ Operator RPC deserialization (~0.1ms)
│ │ ├─ IPFS policy fetch (cached: ~0ms, miss: ~50-200ms)
│ │ ├─ Rego evaluation (~1-10ms)
│ │ ├─ BLS signing (~0.5-1ms)
│ │ ├─ HTTP/2 response serialize (~0.1ms)
│ │ └─ Network return (~same as outbound)
│ ◄──────────────────────────────────── oneshot response
│
├─ Parse ConsensusCommitResponse (~0.05ms)
├─ Process BLS signature via aggregator (~0.1ms)
└─ Check quorum (stake-weighted) (~0.05ms)
Total for external operator: ~3-15ms (localhost) to ~30-100ms+ (cross-region)
Gateway Operator Path (in-process)
Gateway (same process)
│
├─ worker.submit_with_consensus_and_wait()
│ ├─ mpsc channel send (~0.01ms)
│ ├─ GatewayOperatorWorker picks up task
│ │ ├─ IPFS policy fetch (cached: ~0ms, miss: ~50-200ms)
│ │ ├─ Rego evaluation (~1-10ms)
│ │ ├─ BLS signing (~0.5-1ms)
│ │ └─ NO serialization, NO network hop
│ └─ oneshot response (~0.01ms)
│
├─ SignedTaskResponse directly available (~0ms)
└─ Process in unified quorum loop (~0.05ms)
Total for gateway operator: ~2-12ms (dominated by Rego eval + BLS sign)
Where the Savings Come From
┌───────────────────────────────────┬───────────────────┬──────────────────┬──────────────────────────┐
│ Component │ External Operator │ Gateway Operator │ Savings │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Request serialization (JSON) │ ~0.1ms │ 0 │ 0.1ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Network round-trip (localhost) │ ~0.2-1ms │ 0 │ 0.2-1ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Network round-trip (same DC) │ ~1-4ms │ 0 │ 1-4ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Network round-trip (cross-region) │ ~20-100ms │ 0 │ 20-100ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ HTTP/2 overhead (TLS, framing) │ ~0.1-0.3ms │ 0 │ 0.1-0.3ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Response deserialization │ ~0.1ms │ 0 │ 0.1ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ JSON-RPC parsing │ ~0.05ms │ 0 │ 0.05ms │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ Total overhead eliminated │ │ │ ~0.5-2ms (localhost) │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ │ │ │ ~2-5ms (same DC) │
├───────────────────────────────────┼───────────────────┼──────────────────┼──────────────────────────┤
│ │ │ │ ~20-100ms (cross-region) │
└───────────────────────────────────┴───────────────────┴──────────────────┴──────────────────────────┘
The Critical Path: Quorum Completion
The real question isn't "how much faster is one operator?" — it's "how much faster does quorum complete?"
Looking at broadcast_with_unified_quorum (operator.rs:2090), the flow is:
With a 67% quorum threshold, here are the scenarios:
Scenario A: 2 operators (gateway + 1 external), 67% quorum
response is irrelevant.
response.
Scenario B: 3 operators (gateway + 2 external), 67% quorum
operator doesn't block you.
Scenario C: N operators (gateway + N-1 external), 67% quorum
The Real Latency Impact
Task received ──► Generate policyTaskData ──► Broadcast ──► Wait for quorum ──► Aggregate ──► Submit
│ │ │ │
~1-5ms ~0.1ms VARIABLE ~1ms
(data gen + attestation) (try_send) (THE BOTTLENECK) (BLS aggregate)
The quorum wait is the bottleneck. Everything else is fixed overhead. Gateway-as-operator directly reduces this wait:
┌────────────────────┬──────────────────────────┬──────────────────────────┐
│ Deployment │ Quorum Wait (2 ops, 67%) │ Quorum Wait (3 ops, 67%) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Localhost (dev) │ │ │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Without gateway op │ ~3-5ms (1 ext needed) │ ~3-5ms (fastest 2 ext) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ With gateway op │ ~2ms (gateway wins) │ ~3-5ms (gw + 1 ext) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Savings │ ~1-3ms (~30-50%) │ ~0-1ms (~10%) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Same datacenter │ │ │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Without gateway op │ ~5-10ms │ ~8-15ms (wait for 2nd) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ With gateway op │ ~2-3ms (gateway wins) │ ~5-10ms (gw + fastest) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Savings │ ~3-7ms (~50-70%) │ ~3-5ms (~30%) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Cross-region │ │ │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Without gateway op │ ~30-80ms │ ~50-120ms │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ With gateway op │ ~2-3ms (gateway wins) │ ~30-80ms (gw + fastest) │
├────────────────────┼──────────────────────────┼──────────────────────────┤
│ Savings │ ~25-75ms (~80-90%) │ ~20-40ms (~30-40%) │
└────────────────────┴──────────────────────────┴──────────────────────────┘
Key Takeaways
"wait for network" to "instant" because 1 signature is enough. This is where the 80-90% savings happen in cross-region setups.
production deployments where operators are geographically distributed, this is significant.
gateway saves you waiting for the 4th-fastest instead of the 4th-fastest external — marginal.
operator uses the same IPFS cache as the gateway, so cache hits are more likely.
multi-chain operators are more likely to be distributed across regions. The registration complexity cost is real but the latency
benefit per-task is larger.
Recommendation
The current decision to disable gateway operator mode in multi-chain is correct for Phase 1 — it avoids the registration complexity.
But for Phase 2, it's worth revisiting with metrics:
The implementation path would be: embed OperatorChainContext into ChainService (you already built the abstraction in Sub-Phase 4a),
register the gateway as an operator on each chain during deployment, and switch the primary chain's existing operator_core to a
per-chain model.