-
-
Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.
| β― Load Test Report: | |
| Configuration: | |
| - Users: 10 | |
| - Spawn rate: 10 users/sec (all spawn immediately) | |
| - Pacing: 15 seconds between requests per user | |
| - Duration: 5 minutes | |
| - Effective throughput: ~10 requests every 15 seconds (~40 req/min) | |
| HTTP Layer β all 200 requests succeeded, median 2.8s, p95 3.5s | |
| On-chain Submit (RPC β NewTaskCreated) | |
| - 141/200 seen (70.5%) | |
| - p50: 50s Β· p95: 84s Β· p99: 94s | |
| End-to-End (RPC β on-chain response) | |
| - 82/200 completed (41%) | |
| - p50: 138s Β· p95: 192s Β· p99: 204s | |
| Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout | |
| Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx | |
| submission layer. | |
| there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission | |
| flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ | |
| βΏ Β Read crates/gateway/src/rpc/api/tx_worker.rs (597 lines) | |
| βΏ Β Listed directory .claude/ | |
| βΊ This is a critical performance issue. The load test data tells a clear story: the gateway's HTTP layer is fine (sub-4s), but the single-threaded TransactionSubmissionWorker is creating a fatal bottleneck β 59% of tasks never complete | |
| end-to-end. | |
| Let me deeply explore the architecture before proposing solutions. | |
| βΊ 3 Explore agents finished (ctrl+o to expand) | |
| ββ Explore tx submission flow Β· 35 tool uses Β· 114.9k tokens | |
| β βΏ Β Done | |
| ββ Explore AvsWriter nonce mgmt Β· 22 tool uses Β· 118.9k tokens | |
| β βΏ Β Done | |
| ββ Explore gateway config and flow Β· 29 tool uses Β· 106.3k tokens | |
| βΏ Β Done | |
| β― Load Test Report: | |
| Configuration: | |
| - Users: 10 | |
| - Spawn rate: 10 users/sec (all spawn immediately) | |
| - Pacing: 15 seconds between requests per user | |
| - Duration: 5 minutes | |
| - Effective throughput: ~10 requests every 15 seconds (~40 req/min) | |
| HTTP Layer β all 200 requests succeeded, median 2.8s, p95 3.5s | |
| On-chain Submit (RPC β NewTaskCreated) | |
| - 141/200 seen (70.5%) | |
| - p50: 50s Β· p95: 84s Β· p99: 94s | |
| End-to-End (RPC β on-chain response) | |
| - 82/200 completed (41%) | |
| - p50: 138s Β· p95: 192s Β· p99: 204s | |
| Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout | |
| Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx | |
| submission layer. | |
| there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission | |
| flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ /clarify-plan | |
| βΊ Skill(/clarify-plan) | |
| βΏ Β Successfully loaded skill | |
| βΊ Good β I've completed a deep exploration of the gateway's transaction submission architecture. Here's what I found before asking questions: | |
| β Insight βββββββββββββββββββββββββββββββββββββ | |
| The core bottleneck is architectural, not configurational. The TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask AND respondToTask) through a single sequential loop β one tokio task, one mpsc channel, | |
| one-at-a-time processing. Each transaction blocks on get_receipt() (12-60s waiting for block confirmation). At 40 req/min generating 80 txs/min, but draining at ~2-4 txs/min, the queue is fundamentally oversubscribed by 20-40x. | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Here are my first three questions: | |
| 1. Which chain was this load test running against, and what's the target chain for production? | |
| Block time matters enormously here β Base mainnet (2s blocks) vs Sepolia (12s) vs Base Sepolia (2s) changes the theoretical max throughput by 6x. And the nonce pipelining strategy differs by chain's mempool behavior. | |
| 2. Is the single-signer constraint (one ECDSA key for both createNewTask and respondToTask) a hard requirement, or can the gateway operate with multiple signer keys? | |
| The current serialization exists solely to prevent nonce conflicts on a shared signer. If we can use N signers (e.g., a pool of hot wallets), we get NΓ parallelism trivially. If not, we need nonce pipelining (submit tx N+1 before tx N | |
| confirms), which is more complex but still achievable. | |
| 3. What's the target throughput and latency you need? | |
| The load test hit 40 req/min and broke at 41% completion. Are we aiming for 40 req/min sustained with >99% completion? 100 req/min? 1000? And what's the acceptable e2e latency β is the current 2.8s HTTP response fine (since on-chain is async), | |
| or do callers need on-chain confirmation within a specific window? | |
| β» Brewed for 2m 59s | |
| β― 1. sepolia, for prod environment with 10 gateway replicas with 10 task generator signer private keys, 1 for each 2. not a hard requirement I think, but please review the code base (including the solidity contracts code for createNewTask and | |
| respondToTask to confirm my understanding), gateway can operate with multiple signer keys but all task generators have to be configured onchain (createNewTask, and responseToTask guarded with onlyTaskGenerator guard). If my understanding is not | |
| correct though, then nonce pipelining would be needed 3. we are targeting at least 100 RPS (100 requests per second). Acceptable e2e latency for offchain response is around 3 seconds or lower, not longer, ideally around 2 seconds. For onchain, | |
| we have more room, but definitely we want onchain e2e to happen by 1 min of request submission to gateway, ideally within 15-30 seconds (on Ethereum L1/sepolia) | |
| βΊ 2 Explore agents finished (ctrl+o to expand) | |
| ββ Explore TaskManager access control Β· 21 tool uses Β· 90.8k tokens | |
| β βΏ Β Done | |
| ββ Explore nonce pipelining feasibility Β· 28 tool uses Β· 109.1k tokens | |
| βΏ Β Done | |
| βΊ Excellent findings. Let me synthesize what we now know before asking the next batch of questions. | |
| β Insight βββββββββββββββββββββββββββββββββββββ | |
| Multiple task generators are fully supported on-chain. The OperatorRegistry uses EnumerableSet.AddressSet β both createNewTask and respondToTask use onlyTaskGenerator, and Sepolia config already has 2 generators configured. The old aggregator | |
| role is deprecated. This means the multi-signer parallelism path is wide open β no nonce pipelining complexity needed. | |
| The math for 100 RPS on Sepolia (12s blocks): Each task needs 2 txs (create + respond). At 100 RPS, that's 200 txs/sec. With 12s block time, each signer can pipeline ~1 tx per block optimistically, or ~5 txs/block with nonce pipelining. With N | |
| signers doing round-robin, we need enough signers to absorb the throughput. But Sepolia's block gas limit and mempool behavior also constrain this. | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Batch 2 β Architecture & Deployment Constraints: | |
| 4. You mentioned 10 gateway replicas with 10 task generator signer keys. Is each replica currently assigned exactly 1 signer key, or do all 10 replicas share the same key today? | |
| This determines whether we're solving a within-replica parallelism problem (single replica needs multiple signers) or a cross-replica coordination problem (multiple replicas competing for the same nonce space). The 10-key setup suggests you've | |
| already anticipated this β I want to confirm the current deployment topology. | |
| 5. For the 100 RPS target β is that 100 newt_createTask calls/sec, or 100 total RPC calls/sec across all methods? | |
| At 100 task creations/sec on Sepolia (12s blocks), we'd need to land ~1,200 createNewTask + ~1,200 respondToTask = 2,400 txs per 12-second block. That's aggressive for a single chain. Are you planning to batch tasks, or is each RPC call = 1 | |
| on-chain task? | |
| 6. Is the respondToTask submission latency-critical, or only createNewTask? | |
| Currently both flow through the same sequential queue. If respondToTask can be more relaxed (e.g., batched every few seconds), we could prioritize createNewTask in a fast lane and batch responses separately β significantly reducing queue | |
| contention. | |
| β» Cooked for 2m 12s | |
| β― 4. each replica currently assigned exactly 1 signer key. But this can change if needed. For deploy pipeline for stagef vs prod, you can refer to @~/projects/deploy 5. 100 newt_createTask calls/sec 6. Neither has definitely latency requirement | |
| to happen, but for sure we dont want to incur too much delay or discrepancies between offchain task/task response vs onchain. And if possible, yes, we should explore batching mechanism so that we can for example batch submit task | |
| creation/response every interval (like that will be our "block/batch" time). 6. I initially made it same sequential queue in order to ensure that for a single task, task response NEVER gets submitted if the task has not been created | |
| successfully. There is no difference between createNewTask vs respondToTask in terms of being how much latency critical | |
| βΏ Β Listed directory ../deploy/ | |
| βΊ Great β the picture is becoming very clear. Let me ask one final batch to nail down the solution design. | |
| β Insight βββββββββββββββββββββββββββββββββββββ | |
| The math at 100 tasks/sec on Sepolia is extreme. 100 tasks/sec Γ 2 txs Γ 12s block time = 2,400 txs per block across the fleet. With 10 replicas at 1 signer each, that's 240 txs/block/replica. Individual transaction submission simply cannot | |
| scale to this β batching is mandatory. A BatchTaskManager contract registered as a task generator could batch N createNewTask calls into a single tx, reducing the problem by 1-2 orders of magnitude. | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Batch 3 β Solution Constraints: | |
| 7. Is deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable? | |
| This is the highest-leverage change. A contract registered as a task generator that batches 50-100 tasks per transaction would cut the tx count by 50-100x. Same for respondToTask. The alternative β submitting 240 individual txs per block per | |
| replica β is infeasible on any EVM chain regardless of nonce pipelining. Is adding a new contract to the deployment pipeline acceptable, or are we restricted to off-chain-only changes? | |
| 8. For the createβrespond ordering invariant: is it acceptable if creates and responds are in different batches submitted sequentially (batch of creates first, then batch of responds a few seconds later)? | |
| This preserves your invariant (no response without a successful create) while allowing independent batching windows. The tradeoff is slightly higher latency for the respond step (waits for the next batch interval). | |
| 9. What's the gas budget per transaction on Sepolia/mainnet for your task generator signers? | |
| Batching 100 createNewTask calls in one tx could cost 20-50M gas. Sepolia's block gas limit is 30M. If individual createNewTask costs ~200-300K gas, we can fit ~100 per batch. But if it costs 500K+, we're limited to ~60. Do you know the | |
| approximate gas cost of a single createNewTask and respondToTask today? | |
| β» Crunched for 39s | |
| β― 7. deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable 8. yes acceptable 9. I do not, is there a way to estimate? | |
| βΊ Let me estimate gas costs from the existing test suite and contract code. | |
| βΊ Bash(cd /Users/dennis.won/projects/newton-prover-avs/contracts && forge test --match-test "test_createTask\|test_respondToTask\|test_createNewTask" --gas-report -vvβ¦) | |
| βΏ Β Compiling 150 files with Solc 0.8.27 | |
| Solc 0.8.27 finished in 22.85s | |
| Compiler run successful with warnings: | |
| β¦ +13 lines (ctrl+o to expand) | |
| βΏ Β (timeout 2m) | |
| βΊ Explore(Explore createNewTask gas usage) | |
| βΏ Β Done (36 tool uses Β· 91.3k tokens Β· 59s) | |
| (ctrl+o to expand) | |
| βΊ Good β gas estimates from code analysis: | |
| - createNewTask: ~100-150K gas | |
| - respondToTask: ~400-800K gas (BLS verification dominates) | |
| - Sepolia block gas limit: 30M | |
| β Insight βββββββββββββββββββββββββββββββββββββ | |
| Batching math is very favorable. At 150K gas/create, we can fit ~200 creates per batch tx. At 800K gas/respond, ~37 responds per batch. Per replica at 10 tasks/sec over a 12s block: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = ~5 | |
| txs per block per signer. That's trivially achievable. Batching turns a 240-tx-per-block problem into a 5-tx-per-block problem. | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| I now have enough context to design the solution. Let me move to the Design phase. | |
| βΊ Plan(Design batch tx submission plan) | |
| βΏ Β Done (49 tool uses Β· 99.0k tokens Β· 3m 32s) | |
| (ctrl+o to expand) | |
| βΊ The plan from the design agent is comprehensive. Let me verify one critical detail before writing the final plan. | |
| βΏ Β Loaded ../.claude/rules/solidity.md | |
| Searched for 1 pattern, read 1 file (ctrl+o to expand) | |
| βΊ Now I have the full picture. Let me read the rest of respondToTask to confirm the batch contract's parameter requirements, then write the final plan. | |
| Read 1 file (ctrl+o to expand) | |
| βΊ Good β I have the full picture. Now let me write the final plan. | |
| βΊ Updated plan | |
| βΏ Β /plan to preview | |
| βΊ The plan is written. Here's the summary: | |
| β Insight βββββββββββββββββββββββββββββββββββββ | |
| The key insight is that batching transforms the problem entirely. Instead of 240 individual transactions competing for block space per replica per block (impossible), we get 5 batch transactions per block (trivial). The BatchTaskManager | |
| contract is the highest-leverage change β it reduces the on-chain tx count by ~50x. The Rust-side batch accumulator then collects tasks into time-windowed batches (2s intervals), and nonce pipelining ensures those 5 txs don't block | |
| sequentially on receipt confirmation. The entire solution is feature-gated behind batch_submitter.enabled, so the old sequential worker stays as fallback. | |
| The ordering invariant (no respondToTask before createNewTask confirms) is preserved via a DashMap<TaskId, ()> shared between the create and respond accumulators β responds are held until their corresponding create is confirmed on-chain. | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| 4 layers, 6 implementation steps: | |
| 1. BatchTaskManager.sol β batch N calls into 1 tx (Solidity, no Rust dependency) | |
| 2. AvsWriter batch methods β Rust bindings to call the batch contract | |
| 3. BatchTransactionSubmitter β replaces tx_worker.rs with time-windowed batch accumulation | |
| 4. Nonce pipelining β submit 5 batch txs concurrently, await receipts in parallel | |
| 5. Integration testing + load test at 100 tasks/sec | |
| 6. Multi-signer pool (only if needed β math says it won't be) | |
| βΊ User approved Claude's plan | |
| βΏ Β Plan saved to: ~/.claude/plans/valiant-giggling-moler.md Β· /plan to edit | |
| Plan: High-Throughput Batch Transaction Submission | |
| Context | |
| Load testing at just 40 req/min revealed a critical bottleneck: TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask and respondToTask) through a single sequential loop, blocking on get_receipt() (12-60s per | |
| tx on Sepolia). Only 41% of tasks completed end-to-end; 64.5% hit the 180s timeout. | |
| Target: 100 tasks/sec sustained across 10 gateway replicas (10/sec/replica), with <3s offchain response and <30s on-chain e2e on Sepolia (12s blocks). | |
| Root cause: The worker can process ~2-4 txs/min but needs 200 txs/sec. Throughput gap: ~3000x. | |
| Solution: Four coordinated layers β a Solidity batch contract, a Rust batch accumulator, nonce pipelining, and an optional multi-signer pool. | |
| --- | |
| Layer 1: BatchTaskManager Solidity Contract | |
| What | |
| A new standalone contract that batches N createNewTask and N respondToTask calls into single transactions. Registered as a task generator on OperatorRegistry. | |
| Why | |
| At 150K gas/create, one batch tx fits ~200 creates (30M block gas limit). At 800K gas/respond, ~37 per batch. Per replica per 12s block at 10 tasks/sec: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = 5 txs per block instead of | |
| 240. This is the highest-leverage change. | |
| Design | |
| Error semantics: try/catch per item (not revert-all). A single TaskAlreadyExists or TaskAlreadyResponded in a batch of 120 must not revert the other 119. | |
| File: contracts/src/BatchTaskManager.sol | |
| contract BatchTaskManager { | |
| INewtonProverTaskManager public immutable taskManager; | |
| // Per-item result events | |
| event BatchCreateResult(bytes32 indexed taskId, bool success, bytes reason); | |
| event BatchRespondResult(bytes32 indexed taskId, bool success, bytes reason); | |
| constructor(address _taskManager) { | |
| taskManager = INewtonProverTaskManager(_taskManager); | |
| } | |
| function batchCreateTasks( | |
| INewtonProverTaskManager.Task[] calldata tasks | |
| ) external returns (uint256 successCount) { | |
| for (uint256 i; i < tasks.length;) { | |
| try taskManager.createNewTask(tasks[i]) { | |
| emit BatchCreateResult(tasks[i].taskId, true, ""); | |
| unchecked { ++successCount; } | |
| } catch (bytes memory reason) { | |
| emit BatchCreateResult(tasks[i].taskId, false, reason); | |
| } | |
| unchecked { ++i; } | |
| } | |
| } | |
| function batchRespondToTasks( | |
| INewtonProverTaskManager.Task[] calldata tasks, | |
| INewtonProverTaskManager.TaskResponse[] calldata responses, | |
| bytes[] calldata signatureDataArray | |
| ) external returns (uint256 successCount) { | |
| require(tasks.length == responses.length && responses.length == signatureDataArray.length); | |
| for (uint256 i; i < tasks.length;) { | |
| try taskManager.respondToTask(tasks[i], responses[i], signatureDataArray[i]) { | |
| emit BatchRespondResult(responses[i].taskId, true, ""); | |
| unchecked { ++successCount; } | |
| } catch (bytes memory reason) { | |
| emit BatchRespondResult(responses[i].taskId, false, reason); | |
| } | |
| unchecked { ++i; } | |
| } | |
| } | |
| } | |
| Access control: The BatchTaskManager address is registered as a task generator via OperatorRegistry.addTaskGenerator(). The contract itself needs no auth modifier β only registered task generators can call taskManager.createNewTask (the | |
| contract is the msg.sender to TaskManager). Optionally add an onlyAuthorized mapping to prevent gas griefing from arbitrary callers. | |
| Files to create/modify | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β File β Action β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β contracts/src/BatchTaskManager.sol β Create β batch contract β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β contracts/src/interfaces/IBatchTaskManager.sol β Create β interface for Rust bindings β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β contracts/test/BatchTaskManager.t.sol β Create β unit tests β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β contracts/script/BatchTaskManagerDeployer.s.sol β Create β deployment script following DeploymentLib pattern β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Deployment steps | |
| 1. Deploy BatchTaskManager with existing TaskManager address | |
| 2. Call OperatorRegistry.addTaskGenerator(batchTaskManagerAddress) | |
| 3. Write deployment to script/deployments/batch-task-manager/{chainId}-{env}.json | |
| 4. Run make generate-bindings to auto-generate Rust types in crates/core/src/generated/ | |
| --- | |
| Layer 2: Batch Accumulator (Rust β replaces tx_worker.rs) | |
| What | |
| Replace the sequential TransactionSubmissionWorker with BatchTransactionSubmitter that accumulates tasks into time-windowed batches and submits via the BatchTaskManager contract. | |
| Architecture | |
| sync.rs ββββββΊ CreateBatch Accumulator ββflush every 2sβββΊ batchCreateTasks() on-chain | |
| β | |
| (tracks confirmations via DashMap) | |
| β | |
| sync.rs ββββββΊ RespondBatch Accumulator βflush after creates confirmβββΊ batchRespondToTasks() | |
| Two independent accumulator loops as background tokio tasks, connected by a shared DashMap<TaskId, ()> that tracks which creates have been confirmed on-chain. | |
| Batch flush triggers | |
| 1. Timer: batch_interval_ms (default 2000ms β one batch per ~1 block on Sepolia) | |
| 2. Size: batch reaches max_batch_size | |
| 3. Both accumulator loops run independently | |
| Ordering invariant | |
| The respond accumulator holds items in a pending queue. On each flush tick, it moves items whose task_id appears in the confirmed_creates: DashMap<TaskId, ()> into the current respond batch. Items waiting longer than respond_wait_timeout_ms | |
| (default 60s) are dropped with an error log. | |
| Public interface (backward compatible) | |
| BatchTransactionSubmitter exposes the same methods as TransactionSubmissionWorker: | |
| - submit_create_task(task_id, params) -> oneshot::Receiver<Result<...>> | |
| - submit_aggregated_response(task_id, task_request, service_response) -> Result<()> | |
| - queue_capacity() -> usize / queue_depth() -> usize | |
| This means sync.rs and handler/mod.rs only swap the type name. | |
| Config | |
| pub struct BatchSubmitterConfig { | |
| pub enabled: bool, // default: false (feature-gated rollout) | |
| pub max_create_batch_size: usize, // default: 120 | |
| pub max_respond_batch_size: usize, // default: 30 | |
| pub batch_interval_ms: u64, // default: 2000 | |
| pub respond_wait_timeout_ms: u64, // default: 60000 | |
| pub batch_contract_address: Address, // from deployment JSON | |
| } | |
| Files to create/modify | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β File β Action β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/rpc/api/batch_submitter.rs β Create β new batch accumulator β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/rpc/api/mod.rs β Modify β add mod batch_submitter export β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/config.rs β Modify β add BatchSubmitterConfig to GatewayConfig β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/handler/mod.rs β Modify β instantiate BatchTransactionSubmitter when enabled (lines ~549, ~856) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/chainio/src/avs/writer.rs β Modify β add batch_create_tasks() and batch_respond_to_tasks() methods β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/task/submitter.rs β Modify β add batch methods to TaskSubmitter trait (or create BatchTaskSubmitter trait) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Migration path | |
| 1. Phase A: Ship behind batch_submitter.enabled = false. Old TransactionSubmissionWorker remains default. | |
| 2. Phase B: Enable on stagef, load test. Flip default to true. | |
| 3. Phase C: Remove TransactionSubmissionWorker after production validation. | |
| --- | |
| Layer 3: Nonce Pipelining | |
| What | |
| After batching reduces txs from 240/block to ~5/block, we still need those 5 txs submitted without blocking on each receipt sequentially. Split send() from get_receipt(). | |
| Design | |
| In the batch flush path: | |
| // Fire phase: submit all batch txs with incrementing nonces | |
| let pending_txs: Vec<PendingTransaction> = vec![]; | |
| for batch in batches { | |
| let pending = batch_contract.batchCreateTasks(batch).send().await?; | |
| pending_txs.push(pending); | |
| } | |
| // Await phase: concurrently wait for all receipts | |
| let receipts = futures::future::join_all( | |
| pending_txs.into_iter().map(|tx| tx.get_receipt()) | |
| ).await; | |
| Alloy's nonce manager auto-increments on each send() call, so nonces [N, N+1, N+2, ...] are assigned correctly without manual tracking. | |
| Recovery | |
| If tx at nonce N fails after N+1 is in mempool: N+1 is stuck. The recovery path: | |
| 1. Detect the failure from the receipt | |
| 2. Re-submit N with bumped gas (replacement tx) | |
| 3. Or cancel N with a zero-value self-transfer and re-submit the batch | |
| Files to create/modify | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β File β Action β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/chainio/src/avs/writer.rs β Modify β add send_without_receipt() variant for pipelined submission β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/rpc/api/batch_submitter.rs β Modify β use pipelined send in flush logic β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| --- | |
| Layer 4: Multi-Signer Pool (Optional β only if needed) | |
| What | |
| Round-robin across N signer keys per replica. Each signer has independent nonce space. | |
| When to implement | |
| Only if load testing after Layers 1-3 shows a single signer can't keep up. The math says 5 txs/block is well within single-signer capacity, so this is likely unnecessary. | |
| Design | |
| pub struct SignerPool { | |
| signers: Vec<Arc<AvsWriter>>, | |
| next: AtomicUsize, | |
| } | |
| All signer addresses call the same BatchTaskManager contract (which is the registered task generator). No on-chain changes needed beyond Layer 1. | |
| Files | |
| βββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β File β Action β | |
| βββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/signer_pool.rs β Create (only if needed) β | |
| βββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β crates/gateway/src/config.rs β Modify β add additional_signers to TaskGeneratorConfig β | |
| βββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| --- | |
| Expected Performance | |
| ββββββββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ | |
| β Metric β Current β After Fix β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ€ | |
| β Offchain response β 2.8s (OK) β 2.8s (unchanged) β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ€ | |
| β On-chain task creation β p50=50s β ~14s (2s accumulate + 12s block) β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ€ | |
| β On-chain e2e β p50=138s β ~28s (create confirm + 2s + 12s block) β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ€ | |
| β Completion rate β 41% β >99% β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ€ | |
| β Max throughput/replica β ~2-4 txs/min β 10 tasks/sec (120/block) β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ€ | |
| β Txs per block/replica β 240 (impossible) β 5 (trivial) β | |
| ββββββββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ | |
| --- | |
| Implementation Sequence | |
| Step 1: BatchTaskManager contract (no Rust dependency) | |
| 1. Create contracts/src/BatchTaskManager.sol and interface | |
| 2. Create contracts/test/BatchTaskManager.t.sol with tests: | |
| - All succeed, partial failure, auth checks, gas measurement | |
| 3. Create deployment script following DeploymentLib pattern | |
| 4. forge test && forge fmt | |
| 5. Run make generate-bindings | |
| Step 2: AvsWriter batch methods (depends on Step 1 bindings) | |
| 1. Add batch_create_tasks() and batch_respond_to_tasks() to AvsWriter | |
| 2. Unit tests for the new methods | |
| Step 3: BatchTransactionSubmitter (depends on Step 2) | |
| 1. Implement BatchTransactionSubmitter in batch_submitter.rs | |
| 2. Add BatchSubmitterConfig to GatewayConfig | |
| 3. Wire into handler/mod.rs behind enabled flag | |
| 4. Unit tests for accumulator logic, ordering invariant, flush triggers | |
| Step 4: Nonce pipelining (depends on Step 3) | |
| 1. Add pipelined send to the flush path | |
| 2. Add nonce gap recovery logic | |
| Step 5: Integration testing | |
| 1. Update E2E tests to exercise batch path | |
| 2. Load test: 100 tasks/sec sustained for 5 minutes | |
| 3. Verify ordering invariant | |
| 4. Verify partial batch failure handling | |
| Step 6: Multi-signer pool (only if Step 5 load test shows need) | |
| --- | |
| Verification | |
| 1. forge test β all BatchTaskManager contract tests pass | |
| 2. make clippy β no new warnings | |
| 3. make tests β existing unit tests unaffected (feature-gated) | |
| 4. make gateway_e2e β existing E2E still works with old worker (default) | |
| 5. make gateway_e2e with BATCH_SUBMITTER_ENABLED=true β E2E works with new batch path | |
| 6. Load test at 100 tasks/sec for 5 min: >99% completion, <30s on-chain e2e | |
| --- | |
| Risks and Mitigations | |
| ββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Risk β Mitigation β | |
| ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Batch contract deployment fails β Old sequential worker remains as fallback (feature flag) β | |
| ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Partial batch failure loses tasks β try/catch per item; failures logged with task ID for retry β | |
| ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Nonce gap from failed pipeline tx β Re-query nonce, re-submit remaining; fallback to sequential β | |
| ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Respond submitted before create confirms β DashMap gating enforces ordering; items time out rather than submit early β | |
| ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Gas estimation wrong for large batches β Conservative defaults (120 creates, 30 responds); configurable per env β | |
| ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Offchain response latency regresses β Batch accumulator is completely decoupled from RPC response path β no impact β | |
| ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
Good question. Let me evaluate both approaches for check_tasks_on_chain.
Sequential reads vs Multicall
Context: After a receipt timeout on a batch of N items, we need to check allTaskHashes[taskId] for each item to determine which tasks
actually landed on-chain.
Option A: Sequential reads
for item in &items {
let hash = task_manager.allTaskHashes(item.task_id).call().await?;
landed.push(hash != B256::ZERO);
}
Pros:
- Simple, no extra dependency
- Each call is independent β partial failure doesn't lose results
- alloy already has the generated binding for allTaskHashes
Cons:
- N RPC round-trips (one per task in the batch)
- At batch size 120, that's 120 sequential calls after a timeout β adding 10-30s latency on Sepolia
- Compounds the timeout problem we're trying to fix
Option B: Multicall3
let multicall = Multicall3::new(provider, MULTICALL3_ADDR);
let calls: Vec<_> = items.iter().map(|item| Call3 {
target: task_manager_addr,
callData: allTaskHashesCall { taskId: item.task_id }.abi_encode(),
allowFailure: true,
}).collect();
let results = multicall.aggregate3(calls).call().await?;
Pros:
- Single RPC round-trip regardless of batch size
- 120 items checked in ~200ms instead of ~30s
- Multicall3 is deployed at a deterministic address on all EVM chains (including anvil with --auto-impersonate)
Cons:
- Needs Multicall3 address constant (but it's 0xcA11bde05977b3631167028862bE2a173976CA11 on every chain)
- Slightly more complex ABI encoding
- If Multicall3 isn't deployed (unlikely β it's on 99%+ of chains), entire check fails
Option C: eth_call batch simulation (middle ground)
// Simulate the entire batch via eth_call β if it succeeds, nothing is on-chain yet
// If it reverts with BatchPartialFailure, decode which items exist
let result = batch_contract
.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await;
This is what we already planned for simulate_batch_create_and_respond. It doubles as the on-chain check β items that revert with
TaskAlreadyExists are confirmed on-chain.
Pros:
- Single RPC call
- Reuses existing contract ABI (no Multicall3 dependency)
- Gives us BOTH information: what's on-chain AND what would succeed if resubmitted
Cons:
- Heavier than a simple storage read (runs full EVM execution per item)
- Gas estimation overhead (~400K per item Γ 120 = 48M gas simulated)
My recommendation: Option C (eth_call simulation)
Here's why β we already need simulate_batch_create_and_respond for the poison detection path. After a receipt timeout, the question
isn't just "which tasks landed?" β it's "which tasks can be resubmitted?" The eth_call simulation answers both in one call:
- Items that revert with TaskAlreadyExists (0x2e98c533) β confirmed on-chain, send idempotent success
- Items that revert with TaskAlreadyResponded (0x68905dff) β confirmed on-chain AND responded, send idempotent success
- Items that succeed β not on-chain, safe to retry
- Items that revert with other errors β poison, strip them
This means we don't need a separate check_tasks_on_chain method at all. The receipt timeout path becomes:
timeout β simulate_batch(remaining_items) β classify each item β retry/succeed/fail
One method, one RPC call, complete information.
β
Insight βββββββββββββββββββββββββββββββββββββ
The key insight is that eth_call simulation is strictly more informative than allTaskHashes reads. A storage read tells you "this
task exists on-chain." A simulation tells you "this task exists on-chain AND here's exactly what would happen if you tried to
resubmit it." Since we need the simulation for poison detection anyway, using it for the timeout path too eliminates an entire method
from the trait surface.
βββββββββββββββββββββββββββββββββββββββββββββββββ
So the revised TaskSubmitter trait change is:
// Add ONE new method (not two):
async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError>;
// Returns Ok(()) if all would succeed, BatchPartialFailure if some fail,
// or other ChainIoError for RPC issues
// Change return type on existing method:
async fn batch_create_and_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
// (was Result with anyhow)
Does this approach make sense, or do you still want the separate check_tasks_on_chain method?
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #42/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response on Q7 β whether Option C (single simulate_batch_create_and_respond method that handles both timeout
recovery and poison detection) works, or if you'd prefer separate methods.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #43/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Still waiting on your input for Q7 before proceeding with implementation.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #44/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response on Q7.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #45/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #46/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your input on Q7.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #47/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response on Q7.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #48/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #49/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response on Q7.
βΊ Ran 3 stop hooks (ctrl+o to expand)
βΏ Β Stop hook error: [ECOMODE #50/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.
βΊ Waiting for your response on Q7.
β» Cooked for 1m 23s
βΊ Plan is written. Let me exit plan mode so you can review and approve before implementation begins.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Ready to code?
Here is Claude's plan:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Harden Batch Transaction Submission Pipeline
Context
PR #507 (feat/batch-create-and-respond) merged and deployed to stagef+prod. Load testing (15 tasks, centralized mode, single gateway)
showed 60% success (9/15) with 6 recoverable tasks misclassified as poison. Root cause chain:
- TransportError(NullResp) on batch 8 triggers retry
- Receipt timeout (60s x 3 retries = 180s) blocks accumulator
- parse_batch_partial_failure() returns None (stub) β falls to simulation
- Simulation sends REAL transactions (not eth_call), misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
- 6 tasks that were actually on-chain get failed to callers
- 15 zombie BLS aggregator slots linger 360s
Every failure in this cascade is recoverable. The protocol is correct β the error handling pipeline loses information at each layer.
Design Decisions (from grill-me interview)
βββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β # β Decision β Rationale β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1 β TaskSubmitter returns Result<_, ChainIoError> β Stop losing structured error data via format!() β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2 β Receipt timeout β eth_call simulation β Determine which items landed on-chain without blind β
β β β retry β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3 β Poison detection via eth_call (not real txs) β Current simulation sends real transactions β wasteful β
β β β and dangerous β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 4 β Keep accumulator architecture, timeout=30s, max_retries=2 β Accumulator is sound; reduce wasted time in retry loops β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 5 β Decode BatchPartialFailure at writer level β Use as_revert_data() (proven at writer.rs:1091) β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 6 β Exhaustive selector-based decoding β No string matching, no guessing, no fallback defaults β
βββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 7 β Single simulate_batch method for both timeout recovery and β One eth_call gives complete classification β
β β poison detection β β
βββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Files to Modify
Layer 1: Error types and classification
- crates/chainio/src/error.rs β Add new ChainIoError variants, expand is_transient()
- crates/chainio/src/avs/errors.rs β Add exhaustive selector constants and classify_batch_item_error()
Layer 2: Writer (on-chain interaction)
- crates/chainio/src/avs/writer.rs β Implement parse_batch_partial_failure(), reduce timeout, add
simulate_batch_create_and_respond()
Layer 3: TaskSubmitter trait bridge
- crates/gateway/src/task/submitter.rs β Change return types to ChainIoError, add simulate_batch_create_and_respond()
Layer 4: Batch submitter (retry logic)
- crates/gateway/src/rpc/api/batch_submitter.rs β Rewrite flush logic to use structured errors, remove string matching
Layer 5: Config
- crates/gateway/src/rpc/api/batch_submitter.rs β Change defaults: max_retries=2
Layer 6: Mock impls (10 total)
- All TaskSubmitter test impls β add simulate_batch_create_and_respond(), update return types
Step-by-Step Implementation
Step 1: New ChainIoError variants (crates/chainio/src/error.rs)
Add these variants to the existing enum:
/// Task already exists on-chain (idempotent β not an error)
#[error("Task {task_id} already exists on-chain")]
TaskAlreadyExists { task_id: B256 },
/// Task already responded to on-chain (idempotent β not an error)
#[error("Task {task_id} already responded")]
TaskAlreadyResponded { task_id: B256 },
/// Contract reverted with a known error selector
#[error("Contract reverted: {name} (0x{selector})")]
ContractRevert {
selector: String, // 4-byte hex, e.g. "2e98c533"
name: String, // e.g. "TaskAlreadyExists"
raw_data: Vec, // full revert data for downstream decoding
},
/// Per-item failure within a batch, with selector-decoded reason
#[derive(Debug, Clone)]
pub enum BatchItemError {
TaskAlreadyExists,
TaskAlreadyResponded,
ContractRevert { selector: String, name: String },
Unknown { raw: Vec },
}
Update is_transient() to also match TransactionTimeout:
pub fn is_transient(&self) -> bool {
matches!(
self,
ChainIoError::RpcError()
| ChainIoError::AlloyProviderError()
| ChainIoError::TransactionTimeout { .. }
)
}
Step 2: Exhaustive selector classification (crates/chainio/src/avs/errors.rs)
Add classify_batch_item_revert(data: &[u8]) -> BatchItemError:
pub fn classify_batch_item_revert(data: &[u8]) -> BatchItemError {
if data.len() < 4 {
return BatchItemError::Unknown { raw: data.to_vec() };
}
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
[0x2e, 0x98, 0xc5, 0x33] => BatchItemError::TaskAlreadyExists,
[0x68, 0x90, 0x5d, 0xff] => BatchItemError::TaskAlreadyResponded,
_ => {
let name = decode_error_selector(selector)
.unwrap_or_else(|| format!("0x{}", hex::encode(selector)));
BatchItemError::ContractRevert {
selector: hex::encode(selector),
name,
}
}
}
}
Where decode_error_selector is the existing function in writer.rs that maps selectors to names. We'll make it pub and move the
selector-to-name mapping to errors.rs.
Step 3: Implement parse_batch_partial_failure() (crates/chainio/src/avs/writer.rs)
Replace the stub at line 795 with real implementation using as_revert_data():
fn parse_batch_partial_failure(
error: &alloy::contract::Error,
) -> Option<Vec> {
let revert_data = error.as_revert_data()?;
if revert_data.len() < 4 {
return None;
}
let selector: [u8; 4] = revert_data[..4].try_into().ok()?;
// BatchPartialFailure selector: 0x19ed9977
if selector != [0x19, 0xed, 0x99, 0x77] {
return None;
}
// Decode: BatchPartialFailure(FailedItem[] failures)
// FailedItem = (uint256 index, bytes32 taskId, bytes reason)
// Use alloy ABI decoding
use alloy::sol_types::SolType;
type FailedItemTuple = alloy::sol_types::sol_data::Array<
(alloy::sol_types::sol_data::Uint<256>,
alloy::sol_types::sol_data::FixedBytes<32>,
alloy::sol_types::sol_data::Bytes)
>;
let decoded = FailedItemTuple::abi_decode(&revert_data[4..], true).ok()?;
let items = decoded.into_iter().map(|(index, task_id, reason)| {
BatchFailedItem {
index: index.to::(),
task_id: B256::from(task_id),
reason: reason.into(),
}
}).collect();
Some(items)
}
On error path in batch_create_and_respond_to_tasks, use it to return ChainIoError::BatchPartialFailure:
Err(e) => {
if let Some(failures) = parse_batch_partial_failure(&e) {
return Err(ChainIoError::BatchPartialFailure { failures });
}
// Classify the top-level revert
if let Some(revert_data) = e.as_revert_data() {
let item_err = errors::classify_batch_item_revert(&revert_data);
// Convert to appropriate ChainIoError variant
return Err(match item_err { ... });
}
Err(ChainIoError::ContractError(e))
}
Step 4: Add simulate_batch_create_and_respond() to AvsWriter
New method using .call() (eth_call) instead of .send():
pub async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError> {
// Same ABI encoding as batch_create_and_respond_to_tasks
// but use .call().await instead of .send().await
contract.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await
.map(|_| ())
.map_err(|e| {
if let Some(failures) = parse_batch_partial_failure(&e) {
return ChainIoError::BatchPartialFailure { failures };
}
ChainIoError::ContractError(e)
})
}
Step 5: Reduce BATCH_RECEIPT_TIMEOUT (crates/chainio/src/avs/writer.rs)
const BATCH_RECEIPT_TIMEOUT: Duration = Duration::from_secs(30); // was 60
Step 6: Change TaskSubmitter trait (crates/gateway/src/task/submitter.rs)
Change batch method return types from Result (anyhow) to Result<TransactionReceipt, ChainIoError>:
async fn batch_create_and_respond_to_tasks(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<TransactionReceipt, ChainIoError>;
async fn simulate_batch_create_and_respond(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<(), ChainIoError>;
// Also change these existing methods:
async fn batch_create_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
async fn batch_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
Remove the format!() error wrapping in the AvsWriter impl β return ChainIoError directly.
Step 7: Rewrite flush_combined_batch_with_retry() (crates/gateway/src/rpc/api/batch_submitter.rs)
The core rewrite. Replace string-matching error classification with pattern matching on ChainIoError variants.
Change max_retries default from 3 to 2:
fn default_batch_max_retries() -> u32 { 2 } // was 3
New flush logic (pseudocode):
loop {
match task_submitter.batch_create_and_respond_to_tasks(...).await {
Ok(receipt) => {
// Success β notify all callers
for (item, task, ..) in built_items {
let _ = item.result_tx.send(Ok((receipt.clone(), task)));
}
return;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Structured per-item failures β classify each
let failed_map: HashMap<usize, &BatchFailedItem> = ...;
let mut healthy = vec![];
for (i, item) in built_items.into_iter().enumerate() {
if let Some(failure) = failed_map.get(&i) {
let classification = classify_batch_item_revert(&failure.reason);
match classification {
BatchItemError::TaskAlreadyExists
| BatchItemError::TaskAlreadyResponded => {
// Idempotent success
let _ = item.0.result_tx.send(Ok((dummy_receipt(), item.1)));
}
_ => {
// Non-recoverable β fail this item
let _ = item.0.result_tx.send(Err(GatewayError::Internal(...)));
}
}
} else {
healthy.push(item);
}
}
if healthy.is_empty() { return; }
built_items = healthy;
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue; // retry healthy remainder
}
Err(e) if e.is_transient() => {
attempt += 1;
if attempt > max_retries { fail_all; return; }
tokio::time::sleep(backoff).await;
backoff = min(backoff * 2, Duration::from_secs(5));
continue;
}
Err(ChainIoError::TransactionTimeout { .. }) => {
// Receipt timeout β simulate to classify
match task_submitter.simulate_batch_create_and_respond(...).await {
Ok(()) => {
// Nothing on-chain yet β safe to retry
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Some items landed β classify each (same as above)
...
}
Err(e) => {
// Simulation itself failed β treat as transient
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
}
}
Err(e) => {
// Unknown non-transient error β try simulation to salvage
match task_submitter.simulate_batch_create_and_respond(...).await {
// Same classification as timeout path
...
}
}
}
}
Remove:
- extract_failed_indices() β replaced by direct pattern match on BatchPartialFailure
- identify_combined_poison_items() β replaced by simulate_batch_create_and_respond()
- is_transient_error() helper β replaced by ChainIoError::is_transient()
- All error_str.contains(...) string matching
Step 8: Update identify_combined_poison_items β simulation via eth_call
Replace the current implementation that sends REAL single-item batch transactions with eth_call simulation:
async fn simulate_and_classify_items(
items: Vec,
task_submitter: &Arc,
batch_contract_addr: Address,
) -> (Vec, Vec, Vec) {
// (healthy, poison, idempotent)
// For each item, call simulate_batch_create_and_respond with a single-item batch
// Classify the result using BatchItemError
}
Step 9: Return TransactionTimeout from writer on receipt timeout
Currently the timeout path in batch_create_and_respond_to_tasks returns a generic error. Change to:
Err(_elapsed) => {
Err(ChainIoError::TransactionTimeout { timeout_secs: BATCH_RECEIPT_TIMEOUT.as_secs() })
}
Step 10: Update all mock TaskSubmitter implementations (10 impls)
Add simulate_batch_create_and_respond returning Ok(()) (default: all items would succeed).
Change return types on batch methods from Result to Result<TransactionReceipt, ChainIoError>.
Impls to update:
- SuccessSubmitter β batch_submitter.rs tests
- FailingSubmitter β batch_submitter.rs tests
- PoisonDetectingSubmitter β batch_submitter.rs tests
- StallTaskSubmitter β tx_worker.rs tests
- MockAvsWriter β chain_routing.rs tests
- MockAvsWriter β common/mod.rs tests
- MockAvsWriter β send_task.rs tests
- DummySubmitter β secrets_encrypted.rs tests
- RecordingTaskSubmitter β create_task_proof_cid.rs tests
- AvsWriter production impl β submitter.rs
Step 11: Update batch_submitter tests
Update existing tests to verify new error handling:
- test_batch_create_success β verify structured receipt propagation
- test_batch_create_with_poison_items β verify BatchPartialFailure decoding and strip
- test_batch_create_panic_isolation β unchanged (panic handling)
- test_batch_submit_multiple_items β verify idempotent success for TaskAlreadyExists
Add new tests:
- test_batch_timeout_then_simulate β receipt timeout β simulation classifies items
- test_batch_partial_failure_decoding β verify BatchPartialFailure selector decoding
- test_batch_max_retries_2 β verify max 2 retries (not 3)
Verification
- make fmt && make clippy β no warnings
- cargo test -p newton-prover-chainio --lib β writer/error tests pass
- cargo test -p newton-prover-gateway --lib β batch_submitter tests pass
- make gateway_e2e β single-chain E2E
- make e2e scenario=basic β unified E2E (production config path)
β
Insight βββββββββββββββββββββββββββββββββββββ
Here's what this implementation achieves across the full error pipeline:
- No more string matching: Every error classification uses typed ChainIoError variants or 4-byte selector matching via
classify_batch_item_revert(). Zero contains() calls on error strings. - No more stub functions: parse_batch_partial_failure() now uses as_revert_data() to extract and ABI-decode FailedItem[] from
BatchPartialFailure reverts. extract_failed_indices() (the empty stub) is deleted entirely. - Simulation uses eth_call, not real txs: simulate_and_classify() calls simulate_batch_create_and_respond() which uses .call()
(read-only) instead of .send(). No more wasting gas on poison detection. - Reduced retry budget: max_retries defaults to 2 (was 3), BATCH_RECEIPT_TIMEOUT is 30s (was 60s). Worst-case cascade: 30s Γ 2 =
60s, down from 60s Γ 3 = 180s.
βββββββββββββββββββββββββββββββββββββββββββββββββ
Here's a summary of all changes made:
Layer 1: Error types (crates/chainio/src/error.rs)
- Added BatchItemError enum with selector-based classification
- Added ChainIoError variants: TaskAlreadyExists, TaskAlreadyResponded, ContractRevert, TransactionTimeout, RetriesExhausted,
TransactionReverted - Added is_transient() (includes TransactionTimeout) and is_idempotent() methods
Layer 2: Error classification (crates/chainio/src/avs/errors.rs)
- Added selectors module with byte-array constants
- Added classify_batch_item_revert() β exhaustive 4-byte selector classification
- Added classify_top_level_revert() β top-level revert to ChainIoError variant
Layer 3: Writer (crates/chainio/src/avs/writer.rs)
- BATCH_RECEIPT_TIMEOUT: 60s β 30s
- Implemented parse_batch_partial_failure() (was stub returning None)
- Added simulate_batch_create_and_respond() using .call() (eth_call)
- Fixed all batch method error paths with typed ChainIoError variants
Layer 4: TaskSubmitter trait (crates/gateway/src/task/submitter.rs)
- Batch methods return Result<, ChainIoError> (was Result<> via GatewayError)
- Added simulate_batch_create_and_respond() to trait + AvsWriter impl
Layer 5: Batch submitter (crates/gateway/src/rpc/api/batch_submitter.rs)
- max_retries default: 3 β 2
- Removed is_transient_error(), extract_failed_indices(), strip_combined_by_indices(), strip_by_indices()
- Rewrote flush_combined_batch_with_retry() with typed ChainIoError pattern matching
- Rewrote flush_respond_batch() similarly
- Replaced identify_combined_poison_items() (real txs) with simulate_and_classify() (eth_call)
Layer 6: Mock impls (6 test files)
- Updated all 9 mock TaskSubmitter impls with new return types and simulate_batch_create_and_respond()
Question 6: We need new ChainIoError variants for the exhaustive classification. Here's the proposed enum expansion:
Currently the batch path can produce these ChainIoError variants:
Proposed additions:
// New variants in ChainIoError:
/// Task already exists on-chain (idempotent β prior attempt landed)
TaskAlreadyExists { task_id: Option },
/// Task already responded (idempotent β prior create+respond landed)
TaskAlreadyResponded { task_id: Option },
/// Contract reverted with a recognized but non-recoverable error
ContractRevert { selector: [u8; 4], data: Bytes },
/// Receipt wait timed out β tx may or may not be on-chain
ReceiptTimeout { timeout_secs: u64 },
And update BatchPartialFailure.failures items to carry a classification enum instead of raw bytes:
pub enum BatchItemFailure {
AlreadyExists,
AlreadyResponded,
Revert { selector: [u8; 4], data: Bytes },
Unknown { data: Bytes },
}
pub struct BatchFailedItem {
pub index: usize,
pub task_id: B256,
pub failure: BatchItemFailure,
}
This means extract_failed_indices in the batch submitter becomes a simple pattern match β no string parsing, no simulation. Each
failed item is already classified.
My recommendation: This is the right set of variants. ReceiptTimeout replaces the current CreateNewTaskCallFail { reason: "...timed
out..." } string hack. TaskAlreadyExists/TaskAlreadyResponded surface directly instead of being detected via string matching.
ContractRevert catches everything else with preserved data.
The task_id: Option on TaskAlreadyExists/TaskAlreadyResponded is Option because when a whole batch reverts with these, we may
not know which specific task β but for BatchPartialFailure nested reasons, we always know the task_id from the FailedItem.
Agree with these variants?