-
-
Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.
| ❯ Load Test Report: | |
| Configuration: | |
| - Users: 10 | |
| - Spawn rate: 10 users/sec (all spawn immediately) | |
| - Pacing: 15 seconds between requests per user | |
| - Duration: 5 minutes | |
| - Effective throughput: ~10 requests every 15 seconds (~40 req/min) | |
| HTTP Layer — all 200 requests succeeded, median 2.8s, p95 3.5s | |
| On-chain Submit (RPC → NewTaskCreated) | |
| - 141/200 seen (70.5%) | |
| - p50: 50s · p95: 84s · p99: 94s | |
| End-to-End (RPC → on-chain response) | |
| - 82/200 completed (41%) | |
| - p50: 138s · p95: 192s · p99: 204s | |
| Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout | |
| Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain — only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx | |
| submission layer. | |
| there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission | |
| flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ | |
| ⎿ Read crates/gateway/src/rpc/api/tx_worker.rs (597 lines) | |
| ⎿ Listed directory .claude/ | |
| ⏺ This is a critical performance issue. The load test data tells a clear story: the gateway's HTTP layer is fine (sub-4s), but the single-threaded TransactionSubmissionWorker is creating a fatal bottleneck — 59% of tasks never complete | |
| end-to-end. | |
| Let me deeply explore the architecture before proposing solutions. | |
| ⏺ 3 Explore agents finished (ctrl+o to expand) | |
| ├─ Explore tx submission flow · 35 tool uses · 114.9k tokens | |
| │ ⎿ Done | |
| ├─ Explore AvsWriter nonce mgmt · 22 tool uses · 118.9k tokens | |
| │ ⎿ Done | |
| └─ Explore gateway config and flow · 29 tool uses · 106.3k tokens | |
| ⎿ Done | |
| ❯ Load Test Report: | |
| Configuration: | |
| - Users: 10 | |
| - Spawn rate: 10 users/sec (all spawn immediately) | |
| - Pacing: 15 seconds between requests per user | |
| - Duration: 5 minutes | |
| - Effective throughput: ~10 requests every 15 seconds (~40 req/min) | |
| HTTP Layer — all 200 requests succeeded, median 2.8s, p95 3.5s | |
| On-chain Submit (RPC → NewTaskCreated) | |
| - 141/200 seen (70.5%) | |
| - p50: 50s · p95: 84s · p99: 94s | |
| End-to-End (RPC → on-chain response) | |
| - 82/200 completed (41%) | |
| - p50: 138s · p95: 192s · p99: 204s | |
| Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout | |
| Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain — only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx | |
| submission layer. | |
| there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission | |
| flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ /clarify-plan | |
| ⏺ Skill(/clarify-plan) | |
| ⎿ Successfully loaded skill | |
| ⏺ Good — I've completed a deep exploration of the gateway's transaction submission architecture. Here's what I found before asking questions: | |
| ★ Insight ───────────────────────────────────── | |
| The core bottleneck is architectural, not configurational. The TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask AND respondToTask) through a single sequential loop — one tokio task, one mpsc channel, | |
| one-at-a-time processing. Each transaction blocks on get_receipt() (12-60s waiting for block confirmation). At 40 req/min generating 80 txs/min, but draining at ~2-4 txs/min, the queue is fundamentally oversubscribed by 20-40x. | |
| ───────────────────────────────────────────────── | |
| Here are my first three questions: | |
| 1. Which chain was this load test running against, and what's the target chain for production? | |
| Block time matters enormously here — Base mainnet (2s blocks) vs Sepolia (12s) vs Base Sepolia (2s) changes the theoretical max throughput by 6x. And the nonce pipelining strategy differs by chain's mempool behavior. | |
| 2. Is the single-signer constraint (one ECDSA key for both createNewTask and respondToTask) a hard requirement, or can the gateway operate with multiple signer keys? | |
| The current serialization exists solely to prevent nonce conflicts on a shared signer. If we can use N signers (e.g., a pool of hot wallets), we get N× parallelism trivially. If not, we need nonce pipelining (submit tx N+1 before tx N | |
| confirms), which is more complex but still achievable. | |
| 3. What's the target throughput and latency you need? | |
| The load test hit 40 req/min and broke at 41% completion. Are we aiming for 40 req/min sustained with >99% completion? 100 req/min? 1000? And what's the acceptable e2e latency — is the current 2.8s HTTP response fine (since on-chain is async), | |
| or do callers need on-chain confirmation within a specific window? | |
| ✻ Brewed for 2m 59s | |
| ❯ 1. sepolia, for prod environment with 10 gateway replicas with 10 task generator signer private keys, 1 for each 2. not a hard requirement I think, but please review the code base (including the solidity contracts code for createNewTask and | |
| respondToTask to confirm my understanding), gateway can operate with multiple signer keys but all task generators have to be configured onchain (createNewTask, and responseToTask guarded with onlyTaskGenerator guard). If my understanding is not | |
| correct though, then nonce pipelining would be needed 3. we are targeting at least 100 RPS (100 requests per second). Acceptable e2e latency for offchain response is around 3 seconds or lower, not longer, ideally around 2 seconds. For onchain, | |
| we have more room, but definitely we want onchain e2e to happen by 1 min of request submission to gateway, ideally within 15-30 seconds (on Ethereum L1/sepolia) | |
| ⏺ 2 Explore agents finished (ctrl+o to expand) | |
| ├─ Explore TaskManager access control · 21 tool uses · 90.8k tokens | |
| │ ⎿ Done | |
| └─ Explore nonce pipelining feasibility · 28 tool uses · 109.1k tokens | |
| ⎿ Done | |
| ⏺ Excellent findings. Let me synthesize what we now know before asking the next batch of questions. | |
| ★ Insight ───────────────────────────────────── | |
| Multiple task generators are fully supported on-chain. The OperatorRegistry uses EnumerableSet.AddressSet — both createNewTask and respondToTask use onlyTaskGenerator, and Sepolia config already has 2 generators configured. The old aggregator | |
| role is deprecated. This means the multi-signer parallelism path is wide open — no nonce pipelining complexity needed. | |
| The math for 100 RPS on Sepolia (12s blocks): Each task needs 2 txs (create + respond). At 100 RPS, that's 200 txs/sec. With 12s block time, each signer can pipeline ~1 tx per block optimistically, or ~5 txs/block with nonce pipelining. With N | |
| signers doing round-robin, we need enough signers to absorb the throughput. But Sepolia's block gas limit and mempool behavior also constrain this. | |
| ───────────────────────────────────────────────── | |
| Batch 2 — Architecture & Deployment Constraints: | |
| 4. You mentioned 10 gateway replicas with 10 task generator signer keys. Is each replica currently assigned exactly 1 signer key, or do all 10 replicas share the same key today? | |
| This determines whether we're solving a within-replica parallelism problem (single replica needs multiple signers) or a cross-replica coordination problem (multiple replicas competing for the same nonce space). The 10-key setup suggests you've | |
| already anticipated this — I want to confirm the current deployment topology. | |
| 5. For the 100 RPS target — is that 100 newt_createTask calls/sec, or 100 total RPC calls/sec across all methods? | |
| At 100 task creations/sec on Sepolia (12s blocks), we'd need to land ~1,200 createNewTask + ~1,200 respondToTask = 2,400 txs per 12-second block. That's aggressive for a single chain. Are you planning to batch tasks, or is each RPC call = 1 | |
| on-chain task? | |
| 6. Is the respondToTask submission latency-critical, or only createNewTask? | |
| Currently both flow through the same sequential queue. If respondToTask can be more relaxed (e.g., batched every few seconds), we could prioritize createNewTask in a fast lane and batch responses separately — significantly reducing queue | |
| contention. | |
| ✻ Cooked for 2m 12s | |
| ❯ 4. each replica currently assigned exactly 1 signer key. But this can change if needed. For deploy pipeline for stagef vs prod, you can refer to @~/projects/deploy 5. 100 newt_createTask calls/sec 6. Neither has definitely latency requirement | |
| to happen, but for sure we dont want to incur too much delay or discrepancies between offchain task/task response vs onchain. And if possible, yes, we should explore batching mechanism so that we can for example batch submit task | |
| creation/response every interval (like that will be our "block/batch" time). 6. I initially made it same sequential queue in order to ensure that for a single task, task response NEVER gets submitted if the task has not been created | |
| successfully. There is no difference between createNewTask vs respondToTask in terms of being how much latency critical | |
| ⎿ Listed directory ../deploy/ | |
| ⏺ Great — the picture is becoming very clear. Let me ask one final batch to nail down the solution design. | |
| ★ Insight ───────────────────────────────────── | |
| The math at 100 tasks/sec on Sepolia is extreme. 100 tasks/sec × 2 txs × 12s block time = 2,400 txs per block across the fleet. With 10 replicas at 1 signer each, that's 240 txs/block/replica. Individual transaction submission simply cannot | |
| scale to this — batching is mandatory. A BatchTaskManager contract registered as a task generator could batch N createNewTask calls into a single tx, reducing the problem by 1-2 orders of magnitude. | |
| ───────────────────────────────────────────────── | |
| Batch 3 — Solution Constraints: | |
| 7. Is deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable? | |
| This is the highest-leverage change. A contract registered as a task generator that batches 50-100 tasks per transaction would cut the tx count by 50-100x. Same for respondToTask. The alternative — submitting 240 individual txs per block per | |
| replica — is infeasible on any EVM chain regardless of nonce pipelining. Is adding a new contract to the deployment pipeline acceptable, or are we restricted to off-chain-only changes? | |
| 8. For the create→respond ordering invariant: is it acceptable if creates and responds are in different batches submitted sequentially (batch of creates first, then batch of responds a few seconds later)? | |
| This preserves your invariant (no response without a successful create) while allowing independent batching windows. The tradeoff is slightly higher latency for the respond step (waits for the next batch interval). | |
| 9. What's the gas budget per transaction on Sepolia/mainnet for your task generator signers? | |
| Batching 100 createNewTask calls in one tx could cost 20-50M gas. Sepolia's block gas limit is 30M. If individual createNewTask costs ~200-300K gas, we can fit ~100 per batch. But if it costs 500K+, we're limited to ~60. Do you know the | |
| approximate gas cost of a single createNewTask and respondToTask today? | |
| ✻ Crunched for 39s | |
| ❯ 7. deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable 8. yes acceptable 9. I do not, is there a way to estimate? | |
| ⏺ Let me estimate gas costs from the existing test suite and contract code. | |
| ⏺ Bash(cd /Users/dennis.won/projects/newton-prover-avs/contracts && forge test --match-test "test_createTask\|test_respondToTask\|test_createNewTask" --gas-report -vv…) | |
| ⎿ Compiling 150 files with Solc 0.8.27 | |
| Solc 0.8.27 finished in 22.85s | |
| Compiler run successful with warnings: | |
| … +13 lines (ctrl+o to expand) | |
| ⎿ (timeout 2m) | |
| ⏺ Explore(Explore createNewTask gas usage) | |
| ⎿ Done (36 tool uses · 91.3k tokens · 59s) | |
| (ctrl+o to expand) | |
| ⏺ Good — gas estimates from code analysis: | |
| - createNewTask: ~100-150K gas | |
| - respondToTask: ~400-800K gas (BLS verification dominates) | |
| - Sepolia block gas limit: 30M | |
| ★ Insight ───────────────────────────────────── | |
| Batching math is very favorable. At 150K gas/create, we can fit ~200 creates per batch tx. At 800K gas/respond, ~37 responds per batch. Per replica at 10 tasks/sec over a 12s block: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = ~5 | |
| txs per block per signer. That's trivially achievable. Batching turns a 240-tx-per-block problem into a 5-tx-per-block problem. | |
| ───────────────────────────────────────────────── | |
| I now have enough context to design the solution. Let me move to the Design phase. | |
| ⏺ Plan(Design batch tx submission plan) | |
| ⎿ Done (49 tool uses · 99.0k tokens · 3m 32s) | |
| (ctrl+o to expand) | |
| ⏺ The plan from the design agent is comprehensive. Let me verify one critical detail before writing the final plan. | |
| ⎿ Loaded ../.claude/rules/solidity.md | |
| Searched for 1 pattern, read 1 file (ctrl+o to expand) | |
| ⏺ Now I have the full picture. Let me read the rest of respondToTask to confirm the batch contract's parameter requirements, then write the final plan. | |
| Read 1 file (ctrl+o to expand) | |
| ⏺ Good — I have the full picture. Now let me write the final plan. | |
| ⏺ Updated plan | |
| ⎿ /plan to preview | |
| ⏺ The plan is written. Here's the summary: | |
| ★ Insight ───────────────────────────────────── | |
| The key insight is that batching transforms the problem entirely. Instead of 240 individual transactions competing for block space per replica per block (impossible), we get 5 batch transactions per block (trivial). The BatchTaskManager | |
| contract is the highest-leverage change — it reduces the on-chain tx count by ~50x. The Rust-side batch accumulator then collects tasks into time-windowed batches (2s intervals), and nonce pipelining ensures those 5 txs don't block | |
| sequentially on receipt confirmation. The entire solution is feature-gated behind batch_submitter.enabled, so the old sequential worker stays as fallback. | |
| The ordering invariant (no respondToTask before createNewTask confirms) is preserved via a DashMap<TaskId, ()> shared between the create and respond accumulators — responds are held until their corresponding create is confirmed on-chain. | |
| ───────────────────────────────────────────────── | |
| 4 layers, 6 implementation steps: | |
| 1. BatchTaskManager.sol — batch N calls into 1 tx (Solidity, no Rust dependency) | |
| 2. AvsWriter batch methods — Rust bindings to call the batch contract | |
| 3. BatchTransactionSubmitter — replaces tx_worker.rs with time-windowed batch accumulation | |
| 4. Nonce pipelining — submit 5 batch txs concurrently, await receipts in parallel | |
| 5. Integration testing + load test at 100 tasks/sec | |
| 6. Multi-signer pool (only if needed — math says it won't be) | |
| ⏺ User approved Claude's plan | |
| ⎿ Plan saved to: ~/.claude/plans/valiant-giggling-moler.md · /plan to edit | |
| Plan: High-Throughput Batch Transaction Submission | |
| Context | |
| Load testing at just 40 req/min revealed a critical bottleneck: TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask and respondToTask) through a single sequential loop, blocking on get_receipt() (12-60s per | |
| tx on Sepolia). Only 41% of tasks completed end-to-end; 64.5% hit the 180s timeout. | |
| Target: 100 tasks/sec sustained across 10 gateway replicas (10/sec/replica), with <3s offchain response and <30s on-chain e2e on Sepolia (12s blocks). | |
| Root cause: The worker can process ~2-4 txs/min but needs 200 txs/sec. Throughput gap: ~3000x. | |
| Solution: Four coordinated layers — a Solidity batch contract, a Rust batch accumulator, nonce pipelining, and an optional multi-signer pool. | |
| --- | |
| Layer 1: BatchTaskManager Solidity Contract | |
| What | |
| A new standalone contract that batches N createNewTask and N respondToTask calls into single transactions. Registered as a task generator on OperatorRegistry. | |
| Why | |
| At 150K gas/create, one batch tx fits ~200 creates (30M block gas limit). At 800K gas/respond, ~37 per batch. Per replica per 12s block at 10 tasks/sec: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = 5 txs per block instead of | |
| 240. This is the highest-leverage change. | |
| Design | |
| Error semantics: try/catch per item (not revert-all). A single TaskAlreadyExists or TaskAlreadyResponded in a batch of 120 must not revert the other 119. | |
| File: contracts/src/BatchTaskManager.sol | |
| contract BatchTaskManager { | |
| INewtonProverTaskManager public immutable taskManager; | |
| // Per-item result events | |
| event BatchCreateResult(bytes32 indexed taskId, bool success, bytes reason); | |
| event BatchRespondResult(bytes32 indexed taskId, bool success, bytes reason); | |
| constructor(address _taskManager) { | |
| taskManager = INewtonProverTaskManager(_taskManager); | |
| } | |
| function batchCreateTasks( | |
| INewtonProverTaskManager.Task[] calldata tasks | |
| ) external returns (uint256 successCount) { | |
| for (uint256 i; i < tasks.length;) { | |
| try taskManager.createNewTask(tasks[i]) { | |
| emit BatchCreateResult(tasks[i].taskId, true, ""); | |
| unchecked { ++successCount; } | |
| } catch (bytes memory reason) { | |
| emit BatchCreateResult(tasks[i].taskId, false, reason); | |
| } | |
| unchecked { ++i; } | |
| } | |
| } | |
| function batchRespondToTasks( | |
| INewtonProverTaskManager.Task[] calldata tasks, | |
| INewtonProverTaskManager.TaskResponse[] calldata responses, | |
| bytes[] calldata signatureDataArray | |
| ) external returns (uint256 successCount) { | |
| require(tasks.length == responses.length && responses.length == signatureDataArray.length); | |
| for (uint256 i; i < tasks.length;) { | |
| try taskManager.respondToTask(tasks[i], responses[i], signatureDataArray[i]) { | |
| emit BatchRespondResult(responses[i].taskId, true, ""); | |
| unchecked { ++successCount; } | |
| } catch (bytes memory reason) { | |
| emit BatchRespondResult(responses[i].taskId, false, reason); | |
| } | |
| unchecked { ++i; } | |
| } | |
| } | |
| } | |
| Access control: The BatchTaskManager address is registered as a task generator via OperatorRegistry.addTaskGenerator(). The contract itself needs no auth modifier — only registered task generators can call taskManager.createNewTask (the | |
| contract is the msg.sender to TaskManager). Optionally add an onlyAuthorized mapping to prevent gas griefing from arbitrary callers. | |
| Files to create/modify | |
| ┌─────────────────────────────────────────────────┬────────────────────────────────────────────────────────────┐ | |
| │ File │ Action │ | |
| ├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤ | |
| │ contracts/src/BatchTaskManager.sol │ Create — batch contract │ | |
| ├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤ | |
| │ contracts/src/interfaces/IBatchTaskManager.sol │ Create — interface for Rust bindings │ | |
| ├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤ | |
| │ contracts/test/BatchTaskManager.t.sol │ Create — unit tests │ | |
| ├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤ | |
| │ contracts/script/BatchTaskManagerDeployer.s.sol │ Create — deployment script following DeploymentLib pattern │ | |
| └─────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘ | |
| Deployment steps | |
| 1. Deploy BatchTaskManager with existing TaskManager address | |
| 2. Call OperatorRegistry.addTaskGenerator(batchTaskManagerAddress) | |
| 3. Write deployment to script/deployments/batch-task-manager/{chainId}-{env}.json | |
| 4. Run make generate-bindings to auto-generate Rust types in crates/core/src/generated/ | |
| --- | |
| Layer 2: Batch Accumulator (Rust — replaces tx_worker.rs) | |
| What | |
| Replace the sequential TransactionSubmissionWorker with BatchTransactionSubmitter that accumulates tasks into time-windowed batches and submits via the BatchTaskManager contract. | |
| Architecture | |
| sync.rs ─────► CreateBatch Accumulator ──flush every 2s──► batchCreateTasks() on-chain | |
| │ | |
| (tracks confirmations via DashMap) | |
| │ | |
| sync.rs ─────► RespondBatch Accumulator ─flush after creates confirm──► batchRespondToTasks() | |
| Two independent accumulator loops as background tokio tasks, connected by a shared DashMap<TaskId, ()> that tracks which creates have been confirmed on-chain. | |
| Batch flush triggers | |
| 1. Timer: batch_interval_ms (default 2000ms — one batch per ~1 block on Sepolia) | |
| 2. Size: batch reaches max_batch_size | |
| 3. Both accumulator loops run independently | |
| Ordering invariant | |
| The respond accumulator holds items in a pending queue. On each flush tick, it moves items whose task_id appears in the confirmed_creates: DashMap<TaskId, ()> into the current respond batch. Items waiting longer than respond_wait_timeout_ms | |
| (default 60s) are dropped with an error log. | |
| Public interface (backward compatible) | |
| BatchTransactionSubmitter exposes the same methods as TransactionSubmissionWorker: | |
| - submit_create_task(task_id, params) -> oneshot::Receiver<Result<...>> | |
| - submit_aggregated_response(task_id, task_request, service_response) -> Result<()> | |
| - queue_capacity() -> usize / queue_depth() -> usize | |
| This means sync.rs and handler/mod.rs only swap the type name. | |
| Config | |
| pub struct BatchSubmitterConfig { | |
| pub enabled: bool, // default: false (feature-gated rollout) | |
| pub max_create_batch_size: usize, // default: 120 | |
| pub max_respond_batch_size: usize, // default: 30 | |
| pub batch_interval_ms: u64, // default: 2000 | |
| pub respond_wait_timeout_ms: u64, // default: 60000 | |
| pub batch_contract_address: Address, // from deployment JSON | |
| } | |
| Files to create/modify | |
| ┌───────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────┐ | |
| │ File │ Action │ | |
| ├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/rpc/api/batch_submitter.rs │ Create — new batch accumulator │ | |
| ├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/rpc/api/mod.rs │ Modify — add mod batch_submitter export │ | |
| ├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/config.rs │ Modify — add BatchSubmitterConfig to GatewayConfig │ | |
| ├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/handler/mod.rs │ Modify — instantiate BatchTransactionSubmitter when enabled (lines ~549, ~856) │ | |
| ├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ crates/chainio/src/avs/writer.rs │ Modify — add batch_create_tasks() and batch_respond_to_tasks() methods │ | |
| ├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/task/submitter.rs │ Modify — add batch methods to TaskSubmitter trait (or create BatchTaskSubmitter trait) │ | |
| └───────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────┘ | |
| Migration path | |
| 1. Phase A: Ship behind batch_submitter.enabled = false. Old TransactionSubmissionWorker remains default. | |
| 2. Phase B: Enable on stagef, load test. Flip default to true. | |
| 3. Phase C: Remove TransactionSubmissionWorker after production validation. | |
| --- | |
| Layer 3: Nonce Pipelining | |
| What | |
| After batching reduces txs from 240/block to ~5/block, we still need those 5 txs submitted without blocking on each receipt sequentially. Split send() from get_receipt(). | |
| Design | |
| In the batch flush path: | |
| // Fire phase: submit all batch txs with incrementing nonces | |
| let pending_txs: Vec<PendingTransaction> = vec![]; | |
| for batch in batches { | |
| let pending = batch_contract.batchCreateTasks(batch).send().await?; | |
| pending_txs.push(pending); | |
| } | |
| // Await phase: concurrently wait for all receipts | |
| let receipts = futures::future::join_all( | |
| pending_txs.into_iter().map(|tx| tx.get_receipt()) | |
| ).await; | |
| Alloy's nonce manager auto-increments on each send() call, so nonces [N, N+1, N+2, ...] are assigned correctly without manual tracking. | |
| Recovery | |
| If tx at nonce N fails after N+1 is in mempool: N+1 is stuck. The recovery path: | |
| 1. Detect the failure from the receipt | |
| 2. Re-submit N with bumped gas (replacement tx) | |
| 3. Or cancel N with a zero-value self-transfer and re-submit the batch | |
| Files to create/modify | |
| ┌───────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────┐ | |
| │ File │ Action │ | |
| ├───────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤ | |
| │ crates/chainio/src/avs/writer.rs │ Modify — add send_without_receipt() variant for pipelined submission │ | |
| ├───────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/rpc/api/batch_submitter.rs │ Modify — use pipelined send in flush logic │ | |
| └───────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘ | |
| --- | |
| Layer 4: Multi-Signer Pool (Optional — only if needed) | |
| What | |
| Round-robin across N signer keys per replica. Each signer has independent nonce space. | |
| When to implement | |
| Only if load testing after Layers 1-3 shows a single signer can't keep up. The math says 5 txs/block is well within single-signer capacity, so this is likely unnecessary. | |
| Design | |
| pub struct SignerPool { | |
| signers: Vec<Arc<AvsWriter>>, | |
| next: AtomicUsize, | |
| } | |
| All signer addresses call the same BatchTaskManager contract (which is the registered task generator). No on-chain changes needed beyond Layer 1. | |
| Files | |
| ┌───────────────────────────────────┬────────────────────────────────────────────────────────┐ | |
| │ File │ Action │ | |
| ├───────────────────────────────────┼────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/signer_pool.rs │ Create (only if needed) │ | |
| ├───────────────────────────────────┼────────────────────────────────────────────────────────┤ | |
| │ crates/gateway/src/config.rs │ Modify — add additional_signers to TaskGeneratorConfig │ | |
| └───────────────────────────────────┴────────────────────────────────────────────────────────┘ | |
| --- | |
| Expected Performance | |
| ┌────────────────────────┬──────────────────┬────────────────────────────────────────┐ | |
| │ Metric │ Current │ After Fix │ | |
| ├────────────────────────┼──────────────────┼────────────────────────────────────────┤ | |
| │ Offchain response │ 2.8s (OK) │ 2.8s (unchanged) │ | |
| ├────────────────────────┼──────────────────┼────────────────────────────────────────┤ | |
| │ On-chain task creation │ p50=50s │ ~14s (2s accumulate + 12s block) │ | |
| ├────────────────────────┼──────────────────┼────────────────────────────────────────┤ | |
| │ On-chain e2e │ p50=138s │ ~28s (create confirm + 2s + 12s block) │ | |
| ├────────────────────────┼──────────────────┼────────────────────────────────────────┤ | |
| │ Completion rate │ 41% │ >99% │ | |
| ├────────────────────────┼──────────────────┼────────────────────────────────────────┤ | |
| │ Max throughput/replica │ ~2-4 txs/min │ 10 tasks/sec (120/block) │ | |
| ├────────────────────────┼──────────────────┼────────────────────────────────────────┤ | |
| │ Txs per block/replica │ 240 (impossible) │ 5 (trivial) │ | |
| └────────────────────────┴──────────────────┴────────────────────────────────────────┘ | |
| --- | |
| Implementation Sequence | |
| Step 1: BatchTaskManager contract (no Rust dependency) | |
| 1. Create contracts/src/BatchTaskManager.sol and interface | |
| 2. Create contracts/test/BatchTaskManager.t.sol with tests: | |
| - All succeed, partial failure, auth checks, gas measurement | |
| 3. Create deployment script following DeploymentLib pattern | |
| 4. forge test && forge fmt | |
| 5. Run make generate-bindings | |
| Step 2: AvsWriter batch methods (depends on Step 1 bindings) | |
| 1. Add batch_create_tasks() and batch_respond_to_tasks() to AvsWriter | |
| 2. Unit tests for the new methods | |
| Step 3: BatchTransactionSubmitter (depends on Step 2) | |
| 1. Implement BatchTransactionSubmitter in batch_submitter.rs | |
| 2. Add BatchSubmitterConfig to GatewayConfig | |
| 3. Wire into handler/mod.rs behind enabled flag | |
| 4. Unit tests for accumulator logic, ordering invariant, flush triggers | |
| Step 4: Nonce pipelining (depends on Step 3) | |
| 1. Add pipelined send to the flush path | |
| 2. Add nonce gap recovery logic | |
| Step 5: Integration testing | |
| 1. Update E2E tests to exercise batch path | |
| 2. Load test: 100 tasks/sec sustained for 5 minutes | |
| 3. Verify ordering invariant | |
| 4. Verify partial batch failure handling | |
| Step 6: Multi-signer pool (only if Step 5 load test shows need) | |
| --- | |
| Verification | |
| 1. forge test — all BatchTaskManager contract tests pass | |
| 2. make clippy — no new warnings | |
| 3. make tests — existing unit tests unaffected (feature-gated) | |
| 4. make gateway_e2e — existing E2E still works with old worker (default) | |
| 5. make gateway_e2e with BATCH_SUBMITTER_ENABLED=true — E2E works with new batch path | |
| 6. Load test at 100 tasks/sec for 5 min: >99% completion, <30s on-chain e2e | |
| --- | |
| Risks and Mitigations | |
| ┌──────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────┐ | |
| │ Risk │ Mitigation │ | |
| ├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤ | |
| │ Batch contract deployment fails │ Old sequential worker remains as fallback (feature flag) │ | |
| ├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤ | |
| │ Partial batch failure loses tasks │ try/catch per item; failures logged with task ID for retry │ | |
| ├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤ | |
| │ Nonce gap from failed pipeline tx │ Re-query nonce, re-submit remaining; fallback to sequential │ | |
| ├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤ | |
| │ Respond submitted before create confirms │ DashMap gating enforces ordering; items time out rather than submit early │ | |
| ├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤ | |
| │ Gas estimation wrong for large batches │ Conservative defaults (120 creates, 30 responds); configurable per env │ | |
| ├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤ | |
| │ Offchain response latency regresses │ Batch accumulator is completely decoupled from RPC response path — no impact │ | |
| └──────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┘ |
⏺ Clippy clean, but formatting needs a fix. Let me auto-format.
⏺ Bash(cargo fmt --package newton-prover-gateway --package newton-prover-chainio)
⎿ (No output)
⏺ Now let me mark the tasks and commit.
⏺ Bash(git add -A && git commit -m "feat: add strip-and-retry failure handling for batch tx submission" && git push
origin feat/batch-task-manager 2>&1 | tail -10)
⎿ [feat/batch-task-manager 7f3e5d89] feat: add strip-and-retry failure handling for batch tx submission
2 files changed, 254 insertions(+), 115 deletions(-)
remote:
… +6 lines (ctrl+o to expand)
⏺ Done. Three commits on feat/batch-task-manager:
┌──────────┬──────────────────────────────────────────────────────────────────────┐
│ Commit │ Description │
├──────────┼──────────────────────────────────────────────────────────────────────┤
│ e1be99c6 │ BatchTaskManager.sol contract, interface, tests, deployer │
├──────────┼──────────────────────────────────────────────────────────────────────┤
│ 319e31b4 │ BatchTransactionSubmitter + AvsWriter batch methods + gateway wiring │
├──────────┼──────────────────────────────────────────────────────────────────────┤
│ 7f3e5d89 │ Strip-and-retry failure handling + TODO deprecation comments │
└──────────┴──────────────────────────────────────────────────────────────────────┘
★ Insight ─────────────────────────────────────
The strip-and-retry logic handles the three failure classes differently:
- Per-item reverts (e.g., TaskAlreadyExists): handled by Solidity try/catch — batch succeeds, individual items fail
gracefully - Whole-batch revert (non-transient — contract-level): simulate each item individually via eth_call to find poison
tasks, strip them, retry healthy items - Transient errors (RPC, nonce, network): exponential backoff retry of the whole batch, no item stripping
All paths are bounded by max_retries (default 3). No infinite loops, no trading one set of problems for another.
─────────────────────────────────────────────────
Reply posted: https://github.com/newt-foundation/newton-prover-avs/pull/476#discussion_r2998301785
Summary of the response: Albert's gas concern is valid (~20-30K wasted per failed item in try/catch), but we chose
emit-and-continue because:
- Race conditions between 10 concurrent replicas — between eth_call simulation and send(), another replica can create
the same task (TaskAlreadyExists) or blocks advance past the buffer window. Reverting the entire batch on these
expected races wastes MORE gas than per-item try/catch overhead, because all successful items' execution is thrown
away. - The strip-and-retry path already exists — when a whole-batch revert happens (systemic errors, gas exhaustion), the
Rust side simulates each item individually, strips poison tasks, and retries. So the "detect before submit" workflow
Albert describes is already implemented — it just activates on actual reverts rather than as mandatory pre-flight. - Proposed as follow-up: A hybrid approach (succeed when all pass, revert with structured error data when any fail)
could give gas savings on the happy path. Worth exploring once we have load test data on real failure rates.
No code change needed — the current design is the correct default for a production system with concurrent replicas.
✻ Cogitated for 1m 12s
❯ ACtually, each replica for sure will not have overlapping tasks, so they will never be submitting/processing same
tasks. Given this context, re-evaluate Albert's comment
⏺ You're right — that changes things significantly. If replicas never overlap on tasks, then TaskAlreadyExists from
concurrent submission doesn't happen. The main race condition I argued is invalid.
★ Insight ─────────────────────────────────────
Re-evaluation with no replica overlap:
- Happy path (~99%+): ALL items in a batch succeed. Currently we emit 120 BatchCreateResult events (~2-3K gas each =
~300K gas wasted on success events alone). - Albert's approach: happy path commits with zero events. Failure path: collect all failures into an array, revert
with structured error data at the end. The Rust side parses the revert data to identify failed items, strips them,
retries. - The Rust-side strip-and-retry already handles batch reverts — Albert's approach just makes the contract give us the
failure details directly instead of needing per-item simulation.
Albert is right. No events on success, structured revert on failure.
─────────────────────────────────────────────────
Let me implement the change.
⏺ First, update the Solidity contract and interface, then the Rust side.
Remaining Steps to Deploy BatchTaskManager to Stagef/Prod
What PR #476 gives you (code-complete, ready to merge)
- BatchTaskManager.sol contract + tests
- BatchTaskManagerDeployer.s.sol deployment script
- Rust bindings, AvsWriter batch methods, BatchTransactionSubmitter
- Gateway wiring (opt-in via batch_submitter config)
What still needs to happen after merge
Step 1: Deploy BatchTaskManager contract on each chain
The deployer script (BatchTaskManagerDeployer.s.sol) handles everything in one tx:
- Deploys BatchTaskManager(taskManagerAddr, deployer)
- Registers it as a task generator on OperatorRegistry
- Authorizes all taskGenerator addresses from config as batch callers
Per chain:
Sepolia (source chain)
DEPLOYMENT_ENV=stagef forge script script/BatchTaskManagerDeployer.s.sol
--rpc-url $SEPOLIA_RPC_URL --private-key $DEPLOYER_KEY --broadcast -vvvv
Base Sepolia (destination chain)
DEPLOYMENT_ENV=stagef forge script script/BatchTaskManagerDeployer.s.sol
--rpc-url $BASE_SEPOLIA_RPC_URL --private-key $DEPLOYER_KEY --broadcast -vvvv
This writes output to script/deployments/batch-task-manager/{chainId}-stagef.json.
★ Insight ─────────────────────────────────────
The taskGenerator array in newton_prover_config.stagef.json already has the gateway signer addresses for each chain
(e.g., two addresses for Sepolia: 0xD450... and 0xd116...). The deployer reads these and calls addAuthorized() for each,
so every gateway replica can submit batch txs. No manual authorization step needed.
─────────────────────────────────────────────────
Step 2: Configure gateway env vars in deploy repo
Add to the ECS task definition (in newton-prover-avs-deploy):
GATEWAY__BATCH_SUBMITTER__ENABLED=true
GATEWAY__BATCH_SUBMITTER__BATCH_CONTRACT_ADDRESS=<address from step 1>
All other config fields have sane defaults (120 max create batch, 30 max respond batch, 2s interval, 3 retries). The
gateway falls back to the legacy TransactionSubmissionWorker when batch_submitter is absent or enabled=false.
Step 3: Deploy updated gateway binary
The deploy trigger auto-deploys to stagef on merge to main. No separate action needed — the CDK pipeline picks up both
the new binary and new env vars.
Step 4: Verify on stagef
Run a load test similar to the one in the gist (10 users, 15s pacing, 5 min):
- Expect on-chain completion rate to jump from ~41% → >95%
- Expect e2e latency to drop from ~138s p50 → <30s target
Step 5: Prod deployment
- Deploy BatchTaskManager on mainnet chains:
DEPLOYMENT_ENV=prod forge script script/BatchTaskManagerDeployer.s.sol
--rpc-url $MAINNET_RPC_URL --private-key $DEPLOYER_KEY --broadcast -vvvv - Add same env vars to prod ECS config
- Manual workflow_dispatch with deploy_prod: true (reminder: prod deploys are never automatic)
What does NOT need integration with other Newton contracts
- No upgrade to NewtonProverTaskManager — BatchTaskManager is a standalone wrapper that calls the existing TaskManager's
createNewTask and respondToTask - No changes to ChallengeVerifier, ServiceManager, or cross-chain contracts — batch submission is purely a gateway-side
optimization - No transporter or challenger changes — they don't submit tasks
Summary table
┌──────┬────────────────────────────────────────────────────────┬──────────────┬────────────────────────────────────┐
│ Step │ Action │ Blocking? │ Who │
├──────┼────────────────────────────────────────────────────────┼──────────────┼────────────────────────────────────┤
│ 0 │ Merge PR #476 │ Yes │ Dennis (pending 0xkr8os │
│ │ │ │ re-approve) │
├──────┼────────────────────────────────────────────────────────┼──────────────┼────────────────────────────────────┤
│ 1 │ Deploy BatchTaskManager on Sepolia + Base Sepolia │ Yes │ Dennis (forge script) │
├──────┼────────────────────────────────────────────────────────┼──────────────┼────────────────────────────────────┤
│ 2 │ Add GATEWAY__BATCH_SUBMITTER__* env vars to deploy │ Yes │ Dennis (deploy repo PR) │
│ │ repo │ │ │
├──────┼────────────────────────────────────────────────────────┼──────────────┼────────────────────────────────────┤
│ 3 │ Gateway deploys to stagef │ Auto on │ CI │
│ │ │ merge │ │
├──────┼────────────────────────────────────────────────────────┼──────────────┼────────────────────────────────────┤
│ 4 │ Load test on stagef │ Yes │ Dennis │
├──────┼────────────────────────────────────────────────────────┼──────────────┼────────────────────────────────────┤
│ 5 │ Deploy contract + gateway to prod │ Manual │ Dennis │
└──────┴────────────────────────────────────────────────────────┴──────────────┴────────────────────────────────────┘
The contract deployment is the only on-chain step. Everything else is config and binary deployment.
Implementation Plan
- Document the config architecture pattern
Add to CLAUDE.md the three-tier config pattern:
┌───────────────────────────────────────┬───────────────────────────────┬──────────────────────────────────────────────────┐
│ Source │ What goes here │ Examples │
├───────────────────────────────────────┼───────────────────────────────┼──────────────────────────────────────────────────┤
│ Deployment JSONs │ Addresses of contracts we │ newtonProverTaskManager, batchTaskManager, │
│ (script/deployments/) │ deploy and manage │ operatorRegistry │
├───────────────────────────────────────┼───────────────────────────────┼──────────────────────────────────────────────────┤
│ Config JSONs │ External/configurable values │ aggregator_addr (EOA), taskGenerator (EOAs), │
│ (newton_prover_config.{env}.json) │ we don't manage │ task_response_window_block, feature flags │
├───────────────────────────────────────┼───────────────────────────────┼──────────────────────────────────────────────────┤
│ Environment variables │ Secrets and runtime-only │ PRIVATE_KEY, DATABASE_URL, REDIS_URL │
│ │ config │ │
└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────────────────┘
- Add batchTaskManager to DeploymentLib.DeploymentData
- Add field to DeploymentData struct in DeploymentLib.sol
- Use readAddressOr pattern so existing JSONs don't break
- BatchTaskManagerDeployer writes to newton-prover/{chainId}-{env}.json (not a separate directory)
- Update existing deployment JSONs to include the field
- Integrate into deploy-avs.sh
- Post-step after core AVS contracts: deploy BatchTaskManager if not already deployed
- On upgrade=true: skip BatchTaskManager if address already exists in deployment JSON
- Register as task generator on OperatorRegistry, authorize configured taskGenerator addresses
- Support batch=false to skip (for debugging)
- Gateway config refactor
- Remove batch_contract_address from BatchSubmitterConfig
- Gateway reads batchTaskManager from deployment JSON / NewtonAvsConfig at startup
- batch_submitter.enabled defaults to true
- Keep batch=false / GATEWAY__BATCH_SUBMITTER__ENABLED=false as escape hatch for legacy path
- Integration tests default to batch mode
- All E2E test infra deploys BatchTaskManager as part of setup
- Gateway starts with batch submitter enabled by default
- Add batch=false flag to test targets for legacy path testing
- Update make gateway (local dev) to include batch contract in deploy-all
- CI cache version bump
- Bump anvil state cache version so CI picks up fresh state with BatchTaskManager
- Makefile target
- Add deploy-batch-task-manager standalone target (for manual one-off deploys on stagef/prod)
- deploy-all calls it automatically after deploy-avs
Want me to proceed with implementation? This is a substantial set of changes across Solidity deploy scripts, Rust gateway
config, Makefile, E2E test infra, and CLAUDE.md. I'd suggest tackling them in the order listed above.
The batchTaskManager address (0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900) IS present in newton-cross-chain/84532-stagef.json, so the
batch submitter should be initialized. The issue is likely that the on-chain submission is silently failing.
★ Insight ─────────────────────────────────────
Your Datadog filter is searching by task ID (l393a98f3a585a7cb754dcfa6b). The batch submitter's flush logs ("flushing create batch",
"flushing respond batch") don't include individual task IDs — they batch multiple tasks together. So these logs won't appear in a
task-ID-filtered search. Try removing the task ID filter and searching for "flushing" or "batch" to see if the on-chain submissions
are happening and/or failing.
─────────────────────────────────────────────────
That said, the most likely reason all submissions are failing is that the BatchTaskManager contract isn't registered as a
taskGenerator on the NewtonProverTaskManager. Let me verify.
Searched for 2 patterns (ctrl+o to expand)
⏺ Now let me check how the BatchTaskManager contract calls the TaskManager — specifically whether msg.sender to createNewTask would be
the BatchTaskManager contract address.
Read 1 file (ctrl+o to expand)
⏺ There it is. The BatchTaskManager contract calls taskManager.createNewTask(tasks[i]) at line 77 and taskManager.respondToTask(...) at
line 118. When it does, msg.sender to the TaskManager is the BatchTaskManager contract address, not the gateway EOA.
The TaskManager's onlyTaskGenerator modifier at NewtonProverTaskManagerShared.sol:30-36 checks:
IOperatorRegistry(operatorRegistry).isTaskGenerator(msg.sender)
If the BatchTaskManager contract (0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900) was not registered as a taskGenerator in the
OperatorRegistry on Base Sepolia, every batch call reverts with OnlyTaskGenerator. The gateway's EOA was already registered, but the
new BatchTaskManager contract is a separate address.
★ Insight ─────────────────────────────────────
This is a classic delegation pattern issue. Before PR #476, the gateway EOA called createNewTask/respondToTask directly on the
TaskManager (EOA → TaskManager). Now with batching, the call chain is EOA → BatchTaskManager → TaskManager. The TaskManager sees
msg.sender = BatchTaskManager contract, not the gateway EOA. The onlyTaskGenerator check fails because the BatchTaskManager address
was never added to the OperatorRegistry's task generator allowlist.
The batch contract's NatSpec even says: "This contract is registered as a task generator on OperatorRegistry" (line 9) — but this
registration is a deployment step that may have been missed.
─────────────────────────────────────────────────
Two issues to fix
- Register BatchTaskManager as a taskGenerator (critical — blocking all submissions)
On Base Sepolia (84532), call OperatorRegistry.addTaskGenerator()
cast send <OPERATOR_REGISTRY_ADDR>
"addTaskGenerator(address)"
0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900
--rpc-url <BASE_SEPOLIA_RPC>
--private-key <ADMIN_KEY>
The OperatorRegistry address for stagef on Base Sepolia is 0xfd779c14ca9e4e2275fe11ff7b80742bf3e0da4f (from the deployment JSON).
Also register the gateway EOA as authorized on the BatchTaskManager itself (line 31 onlyAuthorized modifier):
cast send 0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900
"addAuthorized(address)"
<GATEWAY_TASK_GENERATOR_EOA>
--rpc-url <BASE_SEPOLIA_RPC>
--private-key <BATCH_TASK_MANAGER_OWNER_KEY>
- Policy schema mismatch (the original error from dd.csv)
The max_daily_spend schema validation failure is a separate, pre-existing issue — the IPFS schema expects max_daily_spend at root but
the on-chain params nest it under agents.
Why you see no error logs in Datadog
The batch submitter's flush errors would appear at error! level but potentially under a different service tag or without the task ID
you're filtering on. Try searching Datadog for:
- "flushing create batch" or "batch create" without a task ID filter
- "OnlyTaskGenerator" in error messages
- Error/Warn levels in the newton-prover-gateway service around the same timeframe
Can you check Datadog for these (no task ID filter, full time range):
- "batch create accumulator cancelled" or "batch respond accumulator cancelled"
- "batch transaction submitter started" — when was the last occurrence?
- Any Error level logs between 12:39 and 15:43 from newton-prover-gateway
Immediate fix: restart the gateway service. This will restart the batch accumulator loops.
Root cause to investigate: what killed the accumulator. The flush_create_batch_with_retry has catch_unwind (line 390-392) to protect
against panics during flush, but a panic elsewhere in the loop (e.g., during channel receive or timeout handling) would silently kill
the spawned task.
✻ Sautéed for 1m 10s
❯ [Image #7] I dont see "batch create accumulator cancelled" or "batch respond accumulator cancelled". no error logs from Error level
logs between 12:39 and 12:43 from newton-prover-gateway
⎿ [Image #7]
⏺ This confirms it. The batch submitter started at 11:15 and was working at 12:38-12:39. Then it silently died — no cancellation, no
error, no crash log.
The accumulator loops run inside tokio::spawn. When a spawned task panics, the panic goes to stderr — not to structured
tracing/Datadog unless a custom panic hook is configured. This is why there are zero Error logs.
Immediate fix
Restart the gateway. This will restart the batch accumulator loops and the pending creates will start flowing again.
Code fix needed
The accumulator loops need panic protection around the entire loop body, not just during flush. Currently catch_unwind only wraps the
flush_create_batch_with_retry call (line 390), but a panic anywhere else in the loop (channel receive, item processing, timer tick
handling) silently kills the spawned task.
Read 1 file (ctrl+o to expand)
⏺ Looking at this code, I now think the issue isn't a panic — it's a hung RPC call. The accumulator loop is likely stuck inside
flush_create_batch_with_retry at line 344, which calls:
// writer.rs:503-507
let send_result = contract.batchCreateTasks(batch_tasks).send().await;
// ...
let receipt = pending.get_receipt().await // ← can hang forever
If a batch tx was sent but the receipt never arrives (tx dropped from mempool, gas too low, RPC node stall), get_receipt().await
blocks indefinitely. The accumulator loop is stuck waiting, all subsequent items queue up in the channel but are never processed.
This explains everything:
- No error/panic logs (nothing failed — it's just waiting)
- No cancellation logs (loop didn't exit)
- Creates queued successfully (channel has capacity)
- But no flush ever fires (loop is blocked on the stuck receipt)
Immediate fix
Restart the gateway.
Code fix
The batch_create_tasks call needs a timeout. Want me to add one? It would wrap the send+receipt in tokio::time::timeout:
// In AvsWriter::batch_create_tasks, wrap the receipt wait
let receipt = tokio::time::timeout(
Duration::from_secs(60),
pending.get_receipt()
).await
.map_err(|_| ChainIoError::CreateNewTaskCallFail {
reason: "batch create receipt timed out after 60s".into()
})??;
Also need to check receipt.status() — per lessons.md, get_receipt() doesn't propagate on-chain reverts.
★ Insight ─────────────────────────────────────
What these changes fix:
- Receipt timeout (60s) — get_receipt().await can hang indefinitely if a tx is dropped from the mempool. The timeout ensures the
accumulator loop recovers and continues processing subsequent batches instead of silently dying. - Receipt status check — Per lessons.md, .get_receipt() returns Ok even for on-chain reverts. Without checking receipt.status(), the
batch submitter would mark failed creates as "confirmed" in the DashMap, then the respond accumulator would try to submit responses
for tasks that don't actually exist on-chain. - Improved logging — Respond submissions now log at INFO level (was DEBUG), making successful on-chain submissions visible in
Datadog. The batch summary log includes success/fail counts for quick health monitoring.
─────────────────────────────────────────────────
Summary of changes
┌───────────────────────────────────────────────┬───────────────────────────────────────────────┬─────────────────────────────────┐
│ File │ Change │ Why │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ 60s timeout on get_receipt() in both │ Prevents accumulator loop from │
│ │ batch_create_tasks and batch_respond_to_tasks │ hanging forever on dropped txs │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ receipt.status() check after receipt │ Detects on-chain reverts that │
│ │ │ get_receipt() silently swallows │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ Log tx_hash on success │ Makes on-chain txs traceable in │
│ │ │ Datadog │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Panic payload in error log (?panic_info) │ Makes panic root cause visible │
│ │ │ in Datadog │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Respond success log promoted to INFO │ Makes successful responds │
│ │ │ visible in Datadog │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Batch flush summary with success/fail counts │ Quick health monitoring per │
│ │ │ flush cycle │
└───────────────────────────────────────────────┴───────────────────────────────────────────────┴─────────────────────────
Great questions — both are important architectural decisions. Let me address them directly.
You're right, the dual path adds confusion. My rationale was a safe rollout — flip enabled: false in prod, test on
stagef first, then switch. But in practice, having two code paths means two things to maintain and two things that can
diverge.
The cleaner approach: make batch the default and only path. The "sequential" behavior is just batch_interval_ms=0,
max_create_batch_size=1, max_respond_batch_size=1 — single-item batches that flush immediately. No need for two
separate implementations. We can remove TransactionSubmissionWorker entirely and have BatchTransactionSubmitter be the
sole submitter, always present (not Option).
I'll refactor this if you agree — it simplifies ChainService from two optional fields to one required field.
You're identifying the real risk. Let me break down the failure modes:
What try/catch handles today: Per-item reverts like TaskAlreadyExists, TaskAlreadyResponded. The batch tx succeeds,
individual items fail gracefully. This is the happy path.
What try/catch does NOT handle: The entire batch tx itself reverting — out of gas mid-execution, gas estimation
failure, nonce issues, RPC errors. When this happens, ALL items in the batch fail together. And this is where the
cascading failure scenario you described kicks in:
Batch of 120 tasks → tx reverts (e.g., 1 task has a poison policy client that OOGs)
→ Retry full batch → same poison task → reverts again
→ Retry → reverts again → max retries → 120 tasks dropped
The fix needs to happen in the Rust-side flush logic, not Solidity. Here's the approach:
Strip-and-retry with bounded attempts:
The simulation step (step 2) is key — eth_call is free and fast. We call batchCreateTasks([item]) for each item in the
failed batch. Items that revert in simulation are removed from the retry batch. This catches poison tasks without
needing to actually submit them.
For truly systemic failures (RPC down, nonce stuck, gas price spike): these are transient and should NOT trigger the
strip-and-retry logic. Only contract-level reverts (non-transient) trigger item stripping.
The decision matrix:
┌────────────────────────────────────┬─────────────────────────────────────┐
│ Failure Type │ Action │
├────────────────────────────────────┼─────────────────────────────────────┤
│ Individual item revert (try/catch) │ Continue batch, log failure │
├────────────────────────────────────┼─────────────────────────────────────┤
│ Whole batch tx reverts (contract) │ Simulate items, strip poison, retry │
├────────────────────────────────────┼─────────────────────────────────────┤
│ Transient (RPC, nonce, timeout) │ Retry whole batch with backoff │
├────────────────────────────────────┼─────────────────────────────────────┤
│ Max retries exceeded │ Fail all remaining items │
└────────────────────────────────────┴─────────────────────────────────────┘
Does this approach address your concern? The key guarantees:
Want me to implement this strip-and-retry logic in batch_submitter.rs, or should we discuss the "remove the Option /
single path" refactor first?