Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created March 26, 2026 22:34
Show Gist options
  • Select an option

  • Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.
Plan: High-Throughput Batch Transaction Submission
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer β€” all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC β†’ NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s Β· p95: 84s Β· p99: 94s
End-to-End (RPC β†’ on-chain response)
- 82/200 completed (41%)
- p50: 138s Β· p95: 192s Β· p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β€” only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/
⎿  Read crates/gateway/src/rpc/api/tx_worker.rs (597 lines)
⎿  Listed directory .claude/
⏺ This is a critical performance issue. The load test data tells a clear story: the gateway's HTTP layer is fine (sub-4s), but the single-threaded TransactionSubmissionWorker is creating a fatal bottleneck β€” 59% of tasks never complete
end-to-end.
Let me deeply explore the architecture before proposing solutions.
⏺ 3 Explore agents finished (ctrl+o to expand)
β”œβ”€ Explore tx submission flow Β· 35 tool uses Β· 114.9k tokens
β”‚ ⎿ Β Done
β”œβ”€ Explore AvsWriter nonce mgmt Β· 22 tool uses Β· 118.9k tokens
β”‚ ⎿ Β Done
└─ Explore gateway config and flow Β· 29 tool uses Β· 106.3k tokens
⎿  Done
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer β€” all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC β†’ NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s Β· p95: 84s Β· p99: 94s
End-to-End (RPC β†’ on-chain response)
- 82/200 completed (41%)
- p50: 138s Β· p95: 192s Β· p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β€” only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ /clarify-plan
⏺ Skill(/clarify-plan)
⎿  Successfully loaded skill
⏺ Good β€” I've completed a deep exploration of the gateway's transaction submission architecture. Here's what I found before asking questions:
β˜… Insight ─────────────────────────────────────
The core bottleneck is architectural, not configurational. The TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask AND respondToTask) through a single sequential loop β€” one tokio task, one mpsc channel,
one-at-a-time processing. Each transaction blocks on get_receipt() (12-60s waiting for block confirmation). At 40 req/min generating 80 txs/min, but draining at ~2-4 txs/min, the queue is fundamentally oversubscribed by 20-40x.
─────────────────────────────────────────────────
Here are my first three questions:
1. Which chain was this load test running against, and what's the target chain for production?
Block time matters enormously here β€” Base mainnet (2s blocks) vs Sepolia (12s) vs Base Sepolia (2s) changes the theoretical max throughput by 6x. And the nonce pipelining strategy differs by chain's mempool behavior.
2. Is the single-signer constraint (one ECDSA key for both createNewTask and respondToTask) a hard requirement, or can the gateway operate with multiple signer keys?
The current serialization exists solely to prevent nonce conflicts on a shared signer. If we can use N signers (e.g., a pool of hot wallets), we get NΓ— parallelism trivially. If not, we need nonce pipelining (submit tx N+1 before tx N
confirms), which is more complex but still achievable.
3. What's the target throughput and latency you need?
The load test hit 40 req/min and broke at 41% completion. Are we aiming for 40 req/min sustained with >99% completion? 100 req/min? 1000? And what's the acceptable e2e latency β€” is the current 2.8s HTTP response fine (since on-chain is async),
or do callers need on-chain confirmation within a specific window?
✻ Brewed for 2m 59s
❯ 1. sepolia, for prod environment with 10 gateway replicas with 10 task generator signer private keys, 1 for each 2. not a hard requirement I think, but please review the code base (including the solidity contracts code for createNewTask and
respondToTask to confirm my understanding), gateway can operate with multiple signer keys but all task generators have to be configured onchain (createNewTask, and responseToTask guarded with onlyTaskGenerator guard). If my understanding is not
correct though, then nonce pipelining would be needed 3. we are targeting at least 100 RPS (100 requests per second). Acceptable e2e latency for offchain response is around 3 seconds or lower, not longer, ideally around 2 seconds. For onchain,
we have more room, but definitely we want onchain e2e to happen by 1 min of request submission to gateway, ideally within 15-30 seconds (on Ethereum L1/sepolia)
⏺ 2 Explore agents finished (ctrl+o to expand)
β”œβ”€ Explore TaskManager access control Β· 21 tool uses Β· 90.8k tokens
β”‚ ⎿ Β Done
└─ Explore nonce pipelining feasibility Β· 28 tool uses Β· 109.1k tokens
⎿  Done
⏺ Excellent findings. Let me synthesize what we now know before asking the next batch of questions.
β˜… Insight ─────────────────────────────────────
Multiple task generators are fully supported on-chain. The OperatorRegistry uses EnumerableSet.AddressSet β€” both createNewTask and respondToTask use onlyTaskGenerator, and Sepolia config already has 2 generators configured. The old aggregator
role is deprecated. This means the multi-signer parallelism path is wide open β€” no nonce pipelining complexity needed.
The math for 100 RPS on Sepolia (12s blocks): Each task needs 2 txs (create + respond). At 100 RPS, that's 200 txs/sec. With 12s block time, each signer can pipeline ~1 tx per block optimistically, or ~5 txs/block with nonce pipelining. With N
signers doing round-robin, we need enough signers to absorb the throughput. But Sepolia's block gas limit and mempool behavior also constrain this.
─────────────────────────────────────────────────
Batch 2 β€” Architecture & Deployment Constraints:
4. You mentioned 10 gateway replicas with 10 task generator signer keys. Is each replica currently assigned exactly 1 signer key, or do all 10 replicas share the same key today?
This determines whether we're solving a within-replica parallelism problem (single replica needs multiple signers) or a cross-replica coordination problem (multiple replicas competing for the same nonce space). The 10-key setup suggests you've
already anticipated this β€” I want to confirm the current deployment topology.
5. For the 100 RPS target β€” is that 100 newt_createTask calls/sec, or 100 total RPC calls/sec across all methods?
At 100 task creations/sec on Sepolia (12s blocks), we'd need to land ~1,200 createNewTask + ~1,200 respondToTask = 2,400 txs per 12-second block. That's aggressive for a single chain. Are you planning to batch tasks, or is each RPC call = 1
on-chain task?
6. Is the respondToTask submission latency-critical, or only createNewTask?
Currently both flow through the same sequential queue. If respondToTask can be more relaxed (e.g., batched every few seconds), we could prioritize createNewTask in a fast lane and batch responses separately β€” significantly reducing queue
contention.
✻ Cooked for 2m 12s
❯ 4. each replica currently assigned exactly 1 signer key. But this can change if needed. For deploy pipeline for stagef vs prod, you can refer to @~/projects/deploy 5. 100 newt_createTask calls/sec 6. Neither has definitely latency requirement
to happen, but for sure we dont want to incur too much delay or discrepancies between offchain task/task response vs onchain. And if possible, yes, we should explore batching mechanism so that we can for example batch submit task
creation/response every interval (like that will be our "block/batch" time). 6. I initially made it same sequential queue in order to ensure that for a single task, task response NEVER gets submitted if the task has not been created
successfully. There is no difference between createNewTask vs respondToTask in terms of being how much latency critical
⎿  Listed directory ../deploy/
⏺ Great β€” the picture is becoming very clear. Let me ask one final batch to nail down the solution design.
β˜… Insight ─────────────────────────────────────
The math at 100 tasks/sec on Sepolia is extreme. 100 tasks/sec Γ— 2 txs Γ— 12s block time = 2,400 txs per block across the fleet. With 10 replicas at 1 signer each, that's 240 txs/block/replica. Individual transaction submission simply cannot
scale to this β€” batching is mandatory. A BatchTaskManager contract registered as a task generator could batch N createNewTask calls into a single tx, reducing the problem by 1-2 orders of magnitude.
─────────────────────────────────────────────────
Batch 3 β€” Solution Constraints:
7. Is deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable?
This is the highest-leverage change. A contract registered as a task generator that batches 50-100 tasks per transaction would cut the tx count by 50-100x. Same for respondToTask. The alternative β€” submitting 240 individual txs per block per
replica β€” is infeasible on any EVM chain regardless of nonce pipelining. Is adding a new contract to the deployment pipeline acceptable, or are we restricted to off-chain-only changes?
8. For the create→respond ordering invariant: is it acceptable if creates and responds are in different batches submitted sequentially (batch of creates first, then batch of responds a few seconds later)?
This preserves your invariant (no response without a successful create) while allowing independent batching windows. The tradeoff is slightly higher latency for the respond step (waits for the next batch interval).
9. What's the gas budget per transaction on Sepolia/mainnet for your task generator signers?
Batching 100 createNewTask calls in one tx could cost 20-50M gas. Sepolia's block gas limit is 30M. If individual createNewTask costs ~200-300K gas, we can fit ~100 per batch. But if it costs 500K+, we're limited to ~60. Do you know the
approximate gas cost of a single createNewTask and respondToTask today?
✻ Crunched for 39s
❯ 7. deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable 8. yes acceptable 9. I do not, is there a way to estimate?
⏺ Let me estimate gas costs from the existing test suite and contract code.
⏺ Bash(cd /Users/dennis.won/projects/newton-prover-avs/contracts && forge test --match-test "test_createTask\|test_respondToTask\|test_createNewTask" --gas-report -vv…)
⎿  Compiling 150 files with Solc 0.8.27
Solc 0.8.27 finished in 22.85s
Compiler run successful with warnings:
… +13 lines (ctrl+o to expand)
⎿  (timeout 2m)
⏺ Explore(Explore createNewTask gas usage)
⎿  Done (36 tool uses · 91.3k tokens · 59s)
(ctrl+o to expand)
⏺ Good β€” gas estimates from code analysis:
- createNewTask: ~100-150K gas
- respondToTask: ~400-800K gas (BLS verification dominates)
- Sepolia block gas limit: 30M
β˜… Insight ─────────────────────────────────────
Batching math is very favorable. At 150K gas/create, we can fit ~200 creates per batch tx. At 800K gas/respond, ~37 responds per batch. Per replica at 10 tasks/sec over a 12s block: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = ~5
txs per block per signer. That's trivially achievable. Batching turns a 240-tx-per-block problem into a 5-tx-per-block problem.
─────────────────────────────────────────────────
I now have enough context to design the solution. Let me move to the Design phase.
⏺ Plan(Design batch tx submission plan)
⎿  Done (49 tool uses · 99.0k tokens · 3m 32s)
(ctrl+o to expand)
⏺ The plan from the design agent is comprehensive. Let me verify one critical detail before writing the final plan.
⎿  Loaded ../.claude/rules/solidity.md
Searched for 1 pattern, read 1 file (ctrl+o to expand)
⏺ Now I have the full picture. Let me read the rest of respondToTask to confirm the batch contract's parameter requirements, then write the final plan.
Read 1 file (ctrl+o to expand)
⏺ Good β€” I have the full picture. Now let me write the final plan.
⏺ Updated plan
⎿  /plan to preview
⏺ The plan is written. Here's the summary:
β˜… Insight ─────────────────────────────────────
The key insight is that batching transforms the problem entirely. Instead of 240 individual transactions competing for block space per replica per block (impossible), we get 5 batch transactions per block (trivial). The BatchTaskManager
contract is the highest-leverage change β€” it reduces the on-chain tx count by ~50x. The Rust-side batch accumulator then collects tasks into time-windowed batches (2s intervals), and nonce pipelining ensures those 5 txs don't block
sequentially on receipt confirmation. The entire solution is feature-gated behind batch_submitter.enabled, so the old sequential worker stays as fallback.
The ordering invariant (no respondToTask before createNewTask confirms) is preserved via a DashMap<TaskId, ()> shared between the create and respond accumulators β€” responds are held until their corresponding create is confirmed on-chain.
─────────────────────────────────────────────────
4 layers, 6 implementation steps:
1. BatchTaskManager.sol β€” batch N calls into 1 tx (Solidity, no Rust dependency)
2. AvsWriter batch methods β€” Rust bindings to call the batch contract
3. BatchTransactionSubmitter β€” replaces tx_worker.rs with time-windowed batch accumulation
4. Nonce pipelining β€” submit 5 batch txs concurrently, await receipts in parallel
5. Integration testing + load test at 100 tasks/sec
6. Multi-signer pool (only if needed β€” math says it won't be)
⏺ User approved Claude's plan
⎿  Plan saved to: ~/.claude/plans/valiant-giggling-moler.md · /plan to edit
Plan: High-Throughput Batch Transaction Submission
Context
Load testing at just 40 req/min revealed a critical bottleneck: TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask and respondToTask) through a single sequential loop, blocking on get_receipt() (12-60s per
tx on Sepolia). Only 41% of tasks completed end-to-end; 64.5% hit the 180s timeout.
Target: 100 tasks/sec sustained across 10 gateway replicas (10/sec/replica), with <3s offchain response and <30s on-chain e2e on Sepolia (12s blocks).
Root cause: The worker can process ~2-4 txs/min but needs 200 txs/sec. Throughput gap: ~3000x.
Solution: Four coordinated layers β€” a Solidity batch contract, a Rust batch accumulator, nonce pipelining, and an optional multi-signer pool.
---
Layer 1: BatchTaskManager Solidity Contract
What
A new standalone contract that batches N createNewTask and N respondToTask calls into single transactions. Registered as a task generator on OperatorRegistry.
Why
At 150K gas/create, one batch tx fits ~200 creates (30M block gas limit). At 800K gas/respond, ~37 per batch. Per replica per 12s block at 10 tasks/sec: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = 5 txs per block instead of
240. This is the highest-leverage change.
Design
Error semantics: try/catch per item (not revert-all). A single TaskAlreadyExists or TaskAlreadyResponded in a batch of 120 must not revert the other 119.
File: contracts/src/BatchTaskManager.sol
contract BatchTaskManager {
INewtonProverTaskManager public immutable taskManager;
// Per-item result events
event BatchCreateResult(bytes32 indexed taskId, bool success, bytes reason);
event BatchRespondResult(bytes32 indexed taskId, bool success, bytes reason);
constructor(address _taskManager) {
taskManager = INewtonProverTaskManager(_taskManager);
}
function batchCreateTasks(
INewtonProverTaskManager.Task[] calldata tasks
) external returns (uint256 successCount) {
for (uint256 i; i < tasks.length;) {
try taskManager.createNewTask(tasks[i]) {
emit BatchCreateResult(tasks[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchCreateResult(tasks[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
function batchRespondToTasks(
INewtonProverTaskManager.Task[] calldata tasks,
INewtonProverTaskManager.TaskResponse[] calldata responses,
bytes[] calldata signatureDataArray
) external returns (uint256 successCount) {
require(tasks.length == responses.length && responses.length == signatureDataArray.length);
for (uint256 i; i < tasks.length;) {
try taskManager.respondToTask(tasks[i], responses[i], signatureDataArray[i]) {
emit BatchRespondResult(responses[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchRespondResult(responses[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
}
Access control: The BatchTaskManager address is registered as a task generator via OperatorRegistry.addTaskGenerator(). The contract itself needs no auth modifier β€” only registered task generators can call taskManager.createNewTask (the
contract is the msg.sender to TaskManager). Optionally add an onlyAuthorized mapping to prevent gas griefing from arbitrary callers.
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/src/BatchTaskManager.sol β”‚ Create β€” batch contract β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/src/interfaces/IBatchTaskManager.sol β”‚ Create β€” interface for Rust bindings β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/test/BatchTaskManager.t.sol β”‚ Create β€” unit tests β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/script/BatchTaskManagerDeployer.s.sol β”‚ Create β€” deployment script following DeploymentLib pattern β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Deployment steps
1. Deploy BatchTaskManager with existing TaskManager address
2. Call OperatorRegistry.addTaskGenerator(batchTaskManagerAddress)
3. Write deployment to script/deployments/batch-task-manager/{chainId}-{env}.json
4. Run make generate-bindings to auto-generate Rust types in crates/core/src/generated/
---
Layer 2: Batch Accumulator (Rust β€” replaces tx_worker.rs)
What
Replace the sequential TransactionSubmissionWorker with BatchTransactionSubmitter that accumulates tasks into time-windowed batches and submits via the BatchTaskManager contract.
Architecture
sync.rs ─────► CreateBatch Accumulator ──flush every 2s──► batchCreateTasks() on-chain
β”‚
(tracks confirmations via DashMap)
β”‚
sync.rs ─────► RespondBatch Accumulator ─flush after creates confirm──► batchRespondToTasks()
Two independent accumulator loops as background tokio tasks, connected by a shared DashMap<TaskId, ()> that tracks which creates have been confirmed on-chain.
Batch flush triggers
1. Timer: batch_interval_ms (default 2000ms β€” one batch per ~1 block on Sepolia)
2. Size: batch reaches max_batch_size
3. Both accumulator loops run independently
Ordering invariant
The respond accumulator holds items in a pending queue. On each flush tick, it moves items whose task_id appears in the confirmed_creates: DashMap<TaskId, ()> into the current respond batch. Items waiting longer than respond_wait_timeout_ms
(default 60s) are dropped with an error log.
Public interface (backward compatible)
BatchTransactionSubmitter exposes the same methods as TransactionSubmissionWorker:
- submit_create_task(task_id, params) -> oneshot::Receiver<Result<...>>
- submit_aggregated_response(task_id, task_request, service_response) -> Result<()>
- queue_capacity() -> usize / queue_depth() -> usize
This means sync.rs and handler/mod.rs only swap the type name.
Config
pub struct BatchSubmitterConfig {
pub enabled: bool, // default: false (feature-gated rollout)
pub max_create_batch_size: usize, // default: 120
pub max_respond_batch_size: usize, // default: 30
pub batch_interval_ms: u64, // default: 2000
pub respond_wait_timeout_ms: u64, // default: 60000
pub batch_contract_address: Address, // from deployment JSON
}
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Create β€” new batch accumulator β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/mod.rs β”‚ Modify β€” add mod batch_submitter export β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/config.rs β”‚ Modify β€” add BatchSubmitterConfig to GatewayConfig β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/handler/mod.rs β”‚ Modify β€” instantiate BatchTransactionSubmitter when enabled (lines ~549, ~856) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Modify β€” add batch_create_tasks() and batch_respond_to_tasks() methods β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/task/submitter.rs β”‚ Modify β€” add batch methods to TaskSubmitter trait (or create BatchTaskSubmitter trait) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Migration path
1. Phase A: Ship behind batch_submitter.enabled = false. Old TransactionSubmissionWorker remains default.
2. Phase B: Enable on stagef, load test. Flip default to true.
3. Phase C: Remove TransactionSubmissionWorker after production validation.
---
Layer 3: Nonce Pipelining
What
After batching reduces txs from 240/block to ~5/block, we still need those 5 txs submitted without blocking on each receipt sequentially. Split send() from get_receipt().
Design
In the batch flush path:
// Fire phase: submit all batch txs with incrementing nonces
let pending_txs: Vec<PendingTransaction> = vec![];
for batch in batches {
let pending = batch_contract.batchCreateTasks(batch).send().await?;
pending_txs.push(pending);
}
// Await phase: concurrently wait for all receipts
let receipts = futures::future::join_all(
pending_txs.into_iter().map(|tx| tx.get_receipt())
).await;
Alloy's nonce manager auto-increments on each send() call, so nonces [N, N+1, N+2, ...] are assigned correctly without manual tracking.
Recovery
If tx at nonce N fails after N+1 is in mempool: N+1 is stuck. The recovery path:
1. Detect the failure from the receipt
2. Re-submit N with bumped gas (replacement tx)
3. Or cancel N with a zero-value self-transfer and re-submit the batch
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Modify β€” add send_without_receipt() variant for pipelined submission β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Modify β€” use pipelined send in flush logic β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Layer 4: Multi-Signer Pool (Optional β€” only if needed)
What
Round-robin across N signer keys per replica. Each signer has independent nonce space.
When to implement
Only if load testing after Layers 1-3 shows a single signer can't keep up. The math says 5 txs/block is well within single-signer capacity, so this is likely unnecessary.
Design
pub struct SignerPool {
signers: Vec<Arc<AvsWriter>>,
next: AtomicUsize,
}
All signer addresses call the same BatchTaskManager contract (which is the registered task generator). No on-chain changes needed beyond Layer 1.
Files
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/signer_pool.rs β”‚ Create (only if needed) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/config.rs β”‚ Modify β€” add additional_signers to TaskGeneratorConfig β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Expected Performance
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ Current β”‚ After Fix β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Offchain response β”‚ 2.8s (OK) β”‚ 2.8s (unchanged) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ On-chain task creation β”‚ p50=50s β”‚ ~14s (2s accumulate + 12s block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ On-chain e2e β”‚ p50=138s β”‚ ~28s (create confirm + 2s + 12s block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Completion rate β”‚ 41% β”‚ >99% β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Max throughput/replica β”‚ ~2-4 txs/min β”‚ 10 tasks/sec (120/block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Txs per block/replica β”‚ 240 (impossible) β”‚ 5 (trivial) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Implementation Sequence
Step 1: BatchTaskManager contract (no Rust dependency)
1. Create contracts/src/BatchTaskManager.sol and interface
2. Create contracts/test/BatchTaskManager.t.sol with tests:
- All succeed, partial failure, auth checks, gas measurement
3. Create deployment script following DeploymentLib pattern
4. forge test && forge fmt
5. Run make generate-bindings
Step 2: AvsWriter batch methods (depends on Step 1 bindings)
1. Add batch_create_tasks() and batch_respond_to_tasks() to AvsWriter
2. Unit tests for the new methods
Step 3: BatchTransactionSubmitter (depends on Step 2)
1. Implement BatchTransactionSubmitter in batch_submitter.rs
2. Add BatchSubmitterConfig to GatewayConfig
3. Wire into handler/mod.rs behind enabled flag
4. Unit tests for accumulator logic, ordering invariant, flush triggers
Step 4: Nonce pipelining (depends on Step 3)
1. Add pipelined send to the flush path
2. Add nonce gap recovery logic
Step 5: Integration testing
1. Update E2E tests to exercise batch path
2. Load test: 100 tasks/sec sustained for 5 minutes
3. Verify ordering invariant
4. Verify partial batch failure handling
Step 6: Multi-signer pool (only if Step 5 load test shows need)
---
Verification
1. forge test β€” all BatchTaskManager contract tests pass
2. make clippy β€” no new warnings
3. make tests β€” existing unit tests unaffected (feature-gated)
4. make gateway_e2e β€” existing E2E still works with old worker (default)
5. make gateway_e2e with BATCH_SUBMITTER_ENABLED=true β€” E2E works with new batch path
6. Load test at 100 tasks/sec for 5 min: >99% completion, <30s on-chain e2e
---
Risks and Mitigations
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Risk β”‚ Mitigation β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Batch contract deployment fails β”‚ Old sequential worker remains as fallback (feature flag) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Partial batch failure loses tasks β”‚ try/catch per item; failures logged with task ID for retry β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Nonce gap from failed pipeline tx β”‚ Re-query nonce, re-submit remaining; fallback to sequential β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Respond submitted before create confirms β”‚ DashMap gating enforces ordering; items time out rather than submit early β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Gas estimation wrong for large batches β”‚ Conservative defaults (120 creates, 30 responds); configurable per env β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Offchain response latency regresses β”‚ Batch accumulator is completely decoupled from RPC response path β€” no impact β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
@denniswon
Copy link
Copy Markdown
Author

Load Test Analysis

Consensus Mode Segmentation

All 15 tasks used centralized mode (centralized_unified_quorum). Zero two-phase consensus or threshold tasks were in this load test.
So the failures are entirely in the batch submission pipeline, not consensus-mode-specific.

Timeline Reconstruction

Phase 1: Success (23:24:55 – 23:26:39) β€” 9 tasks across 7 batch flushes, all succeed

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Batch Flush β”‚ Items β”‚ Time to Confirm β”‚ Outcome β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:24:55 β”‚ 2 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:09 β”‚ 1 β”‚ 22s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:31 β”‚ 1 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:40 β”‚ 1 β”‚ 13s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:09 β”‚ 2 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:25 β”‚ 1 β”‚ 15s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:39 β”‚ 1 β”‚ ... β”‚ cascade starts here β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 2: Cascade Begins (23:26:39 – 23:26:54) β€” Batch 8 (1 item)

  1. 23:26:39 β€” Batch 8 flushed (1 item for task 0xaf9d...)
  2. 23:26:54 β€” TransportError(NullResp) β€” RPC returned null (likely Sepolia node hiccup)
  3. Retry β†’ gas estimation call β†’ contract reverts (task already on-chain from the first attempt that got NullResp)
  4. could not parse revert data β€” parse_batch_partial_failure returns None (the known TODO)
  5. Simulation detects idempotent success β†’ 2 items stripped as "already on-chain"
  6. No healthy items remaining β†’ batch complete. These 2 tasks actually succeeded (idempotent).

Phase 3: Fatal Cascade (23:26:54 – 23:30:57) β€” Batch 9 (1 item) and Batch 10 (5 items)

Batch 9 (1 item, flushed 23:26:54):

  • 23:27:54 β€” Receipt timeout (60s) β€” tx likely dropped from mempool
  • Classified as transient β†’ retry
  • 23:28:55 β€” Receipt timeout again (60s)
  • Classified as transient β†’ retry
  • 23:28:55 β€” gas estimation reverts (item on-chain from prior attempt)
  • could not parse revert data β†’ simulation
  • Simulation classifies as poison (not idempotent) β†’ 1 "combined create+respond failed"
  • Batch 10 (5 items) flushed at 23:28:56

Batch 10 (5 items, flushed 23:28:56):

  • 23:29:56 β€” Receipt timeout (60s)
  • Classified as transient β†’ retry
  • 23:29:56 β€” gas estimation reverts
  • could not parse revert data β†’ simulation
  • 23:30:57 β€” ALL 5 items classified as poison β†’ 5 "combined create+respond failed"

Phase 4: Zombie Aggregators (23:30:51 – 23:34:06) β€” All 15 BLS aggregator slots expire

All 15 tasks had BLS aggregation slots that were never cancelled. They expire after 360s (6 min), each showing processed 0
signatures. The failed to send task expired error on 6 of them means the oneshot::Receiver was already dropped (the caller in sync.rs
moved on after batch failure).

Root Causes (Ordered by Severity)

  1. parse_batch_partial_failure always returns None (writer.rs:795)
    The as_revert_data() API exists and is already used at writer.rs:1091, but the batch parser is a TODO stub. This means every contract
    revert during retry falls through to simulation, which is slow and unreliable.

  2. Simulation misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
    When identify_combined_poison_items tests a single item, the contract wraps TaskAlreadyExists inside BatchPartialFailure. The
    is_task_already_exists() check sees BatchPartialFailure, not TaskAlreadyExists, and classifies the item as poison instead of
    idempotent.

  3. Receipt timeout (60s) is classified as transient and retried with same items
    After timeout, the first tx may have already landed. Retrying the same batch guarantees a contract revert. The system should
    pre-check on-chain state before retrying after a timeout.

  4. BLS aggregator slots are never cancelled on batch failure
    The cancel_task call exists in sync.rs error paths, but the batch submitter communicates failure via the oneshot channel. If the
    sync.rs caller has already timed out or dropped the receiver, cancel_task is never called. The slots linger for 360s consuming
    memory.

  5. Second batch (5 items) flushed while prior batch is still retrying
    The accumulator loop flushes new items even though the prior batch is stuck in the 60s receipt timeout loop. These 5 items enter the
    same failing retry cascade.

Quantitative Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ Value β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total tasks β”‚ 15 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Successfully on-chain β”‚ 9 (60%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Idempotent success (recovered) β”‚ 2 (13%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Failed (misclassified as poison) β”‚ 6 (40%) β€” all recoverable β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Zombie aggregator slots β”‚ 15 (100%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Receipt timeouts β”‚ 4 (across 2 batch retry cycles) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time wasted in retry loops β”‚ ~4 min β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The core issue hasn't changed from the prior session: implementing parse_batch_partial_failure would have prevented 100% of the
failures. The 6 failed tasks were all on-chain already β€” they were just misidentified as poison because the revert data couldn't be
decoded.

Want me to proceed with implementing the fixes? The priority order would be:

  1. Implement parse_batch_partial_failure (the single-function fix that unblocks the whole chain)
  2. Fix simulation to detect TaskAlreadyExists nested inside BatchPartialFailure
  3. Add pre-check on-chain state before retry after receipt timeout
  4. Ensure cancel_task is called from the batch submitter on final failure

@denniswon
Copy link
Copy Markdown
Author

❯ we need to fix all these issues. proceed. But we need much more fundamental fix to prevent Time wasted in retry loops (not
acceptable). we should leave NO TODOs at all in regards to this problem related items. Review this part of the codebase for onchain
transaction submission handling from the first principles and perform a thorough analysis on the architecture/design/code/logic before
implementing. /grill-me

⏺ Skill(grill-me)
⎿  Successfully loaded skill

⏺ Let me start by thoroughly reading the current batch submission pipeline to understand the architecture from first principles before
asking questions.

⏺ Explore(Read batch submission pipeline)
⎿  Done (56 tool uses · 143.1k tokens · 2m 16s)
(ctrl+o to expand)

⏺ Good, I have the full architecture. Now let me read the exact code for the critical paths to ground my understanding.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/CLAUDE.md
⎿  Loaded crates/chainio/src/CLAUDE.md
⎿  Loaded crates/chainio/src/avs/CLAUDE.md
⎿  Loaded crates/gateway/CLAUDE.md
⎿  Loaded crates/gateway/src/CLAUDE.md
⎿  Loaded crates/gateway/src/rpc/CLAUDE.md
⎿  Loaded crates/gateway/src/rpc/api/CLAUDE.md

⏺ Now let me read the identify_combined_poison_items function and the error type flow.

Read 2 files (ctrl+o to expand)

⏺ Now let me check how the error flows across the trait boundary β€” the TaskSubmitter wraps ChainIoError into GatewayError, which is the
key information loss point.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/gateway/src/task/CLAUDE.md

⏺ Now I have the complete picture of the architecture. Here's my first question:


Question 1: The TaskSubmitter trait erases ChainIoError into GatewayError::Internal(String). Should we change the trait's return type
to preserve structured errors?

The fundamental design flaw is here in submitter.rs:105-114:

async fn batch_create_and_respond_to_tasks(...) -> Result {
AvsWriter::batch_create_and_respond_to_tasks(...)
.await
.map_err(|e| GatewayError::Internal(format!("Failed to batch create and respond: {e}")))
}

This .map_err(format!(...)) destroys ALL structured error data. The ChainIoError::BatchPartialFailure { failures } with per-item
indices and reasons becomes the string "Failed to batch create and respond: Batch partial failure: 3 items failed". Then
extract_failed_indices in batch_submitter.rs tries to recover that data from the string β€” and obviously can't.

My recommendation: Change the TaskSubmitter trait's batch methods to return Result<TransactionReceipt, ChainIoError> directly instead
of Result<TransactionReceipt, GatewayError>. The batch submitter is the only consumer of these methods, and it needs the structured
error data. This is a one-layer change β€” batch_submitter.rs already imports ChainIoError types via newton_prover_chainio. The
GatewayError wrapping would move to the callers that actually need it (sync.rs), not happen at the trait boundary.

The alternative is to implement parse_batch_partial_failure in writer.rs so the structured data is decoded before it crosses the
trait boundary β€” but that only solves one error type. The trait-level fix solves all current and future structured error needs.

Which approach do you prefer?

✻ Crunched for 3m 5s

❯ agree with your recommendation

⏺ Question 2: Receipt timeout currently returns ChainIoError::CreateNewTaskCallFail (a string). After a timeout, the tx may have landed
on-chain. Should the batch submitter check on-chain state before retrying?

Currently at writer.rs:759-763:
let receipt = timeout(Self::BATCH_RECEIPT_TIMEOUT, pending.get_receipt())
.await
.map_err(|_| ChainIoError::CreateNewTaskCallFail {
reason: "batch create+respond receipt timed out after 60s".to_string(),
})?

This is classified as transient by is_transient_rpc_error (matches "timeout"), so the batch submitter retries the same items. But the
tx may have already been mined β€” the receipt poll just didn't catch it. The retry sends a duplicate tx, which reverts with
BatchPartialFailure(TaskAlreadyExists), triggering the broken simulation fallback.

My recommendation: Two changes working together:

  1. Return a distinct error variant β€” ChainIoError::TransactionTimeout { timeout_secs: 60 } (already exists in the enum but unused).
    This lets the batch submitter handle timeouts differently from other transient errors.
  2. In the batch submitter, on receipt timeout: instead of blind retry, call identify_combined_poison_items immediately. This
    simulates each item as a single-item batch. Items already on-chain get TaskAlreadyExists β†’ idempotent success. Items not on-chain
    pass simulation β†’ healthy, re-batched and retried. This eliminates the entire 60sΓ—3 retry cascade (the "4 minutes wasted" from the
    load test).

The alternative is adding an on-chain eth_call check per item (e.g., querying allTaskHashes[taskId]), but simulation already does
this implicitly and handles edge cases like partially-landed batches. The simulation path just needs the TaskAlreadyExists detection
to work β€” which it will once we fix the parse_batch_partial_failure at the writer level.

Does this approach make sense, or do you want a direct on-chain state check instead?

❯ what is your recommendation? is simulation an overkill compared to just eth call?

⏺ Good pushback. Let me think about this concretely.

Simulation = call batch_create_and_respond_to_tasks([single_item]) per item. This is a full contract call that tries to execute
create+respond, consuming significant gas estimation compute on the RPC node. For N items, that's N separate eth_calls, each encoding
the full batch contract ABI with task/response/signature data.

Direct on-chain check = call taskManager.allTaskHashes(taskId) per item. This is a single storage slot read β€” trivially cheap, ~200
gas equivalent, returns bytes32(0) if not created or the hash if created.

For the combined path, if allTaskHashes(taskId) != bytes32(0), the task was created AND responded (atomic), so it's idempotent
success. If it's zero, the task never landed β€” safe to retry.

My revised recommendation: Direct on-chain check via allTaskHashes. It's:

  • 10-100x cheaper per item than simulation
  • Semantically precise (answers exactly "did this task land?")
  • No false positives from unrelated revert reasons
  • Can check all items in a single multicall or sequential loop

We'd add a method to TaskSubmitter like is_task_on_chain(task_id) -> bool that reads allTaskHashes, and call it on receipt timeout
before deciding whether to retry or declare idempotent success.

Simulation stays as the last-resort fallback only for non-timeout contract reverts where we can't parse BatchPartialFailure (which
should be rare once we implement the parser).

❯ sounds good

⏺ Question 3: The identify_combined_poison_items simulation sends actual on-chain transactions (not eth_call). Each "simulation" is a
real batch_create_and_respond_to_tasks call that goes through .send().await and get_receipt(). Is this intentional?

Looking at batch_submitter.rs:648-655:
let sim_result = task_submitter
.batch_create_and_respond_to_tasks(
batch_contract_addr,
vec![task.clone()],
vec![response.clone()],
vec![sig.clone()],
)
.await;

This calls the real AvsWriter::batch_create_and_respond_to_tasks, which calls .send().await β€” submitting an actual transaction, not a
read-only eth_call. If the task is NOT on-chain, the "simulation" actually creates+responds to it on-chain as a side effect. If the
task IS on-chain, the .send() call reverts during gas estimation (before tx submission), so no tx is sent.

This means:

  • For items that are already on-chain: reverts at gas estimation β†’ correctly classified
  • For healthy items: actually submits a real tx β†’ succeeds, but now you have N individual txs instead of one batch

My recommendation: This is actually fine functionally (items get submitted either way), but it's wasteful β€” defeats the purpose of
batching. However, since we're moving to direct allTaskHashes checks for receipt timeouts, simulation will only be reached for
non-timeout contract reverts where parse_batch_partial_failure fails. Once we implement the parser, simulation becomes a truly rare
last-resort fallback.

So: keep simulation as-is for now, but it should almost never be reached after our fixes. The priority path becomes: parse structured
error β†’ strip β†’ retry. Timeout path becomes: check allTaskHashes β†’ classify β†’ retry healthy only.

Agree with keeping simulation as the rare fallback, or do you want to convert it to use eth_call instead of real transactions?

❯ which one is better? which one do you suggest?

⏺ Converting to eth_call is strictly better. Here's why:

Real tx submission (current):

  • Healthy items get submitted as individual txs (N txs instead of 1 batch β€” wastes gas and nonce space)
  • Each goes through the 60s receipt timeout path, potentially cascading further
  • If 2 of 5 items are poison, the 3 healthy ones get submitted individually rather than re-batched

eth_call simulation (proposed):

  • Zero on-chain side effects β€” purely diagnostic
  • Fast (no mempool, no receipt wait β€” just gas estimation)
  • After classification, healthy items stay in the batch for a single re-batched submission
  • No nonce consumption

My recommendation: Convert identify_combined_poison_items to use eth_call via a new TaskSubmitter method like
simulate_batch_create_and_respond(single_item) -> Result<(), ChainIoError>. Under the hood, this calls .call().await instead of
.send().await on the contract instance. The classification logic stays identical β€” we just use the return value to determine healthy
vs poison vs already-exists without actually submitting.

This also eliminates a subtle bug: if simulation succeeds for a "healthy" item (submitting a real tx), and then the re-batched retry
also includes it, you get a TaskAlreadyExists from the item you just accidentally submitted during simulation.

❯ sounds good

⏺ Question 4: The batch accumulator loop flushes new items while a prior batch is stuck in the retry/timeout loop. Should we block
accumulation during flush, or decouple them?

From the load test: Batch 9 (1 item) entered a 60sΓ—2 receipt timeout loop. Meanwhile, 5 new items completed BLS aggregation and
queued up. When batch 9 finally failed at 23:28:56, those 5 items were immediately flushed as batch 10 β€” which then hit the same
cascading timeout pattern.

The current architecture runs the accumulator and flush in the same tokio::spawn task. While flush_combined_batch_with_retry is
await-ing receipt timeouts (up to 180s with 3 retries), the mpsc channel buffers incoming items. When flush returns, the loop
immediately drains the channel and flushes the next batch.

My recommendation: This is actually the correct design β€” we should NOT block accumulation. Items need somewhere to queue while flush
is in progress. The real fix is making flush fast:

  1. Receipt timeout β†’ immediate allTaskHashes check (seconds, not 60s retry)
  2. parse_batch_partial_failure works β†’ instant strip-and-retry (no simulation)
  3. Simulation (rare fallback) uses eth_call β†’ fast, no receipt wait

With these fixes, the worst-case flush time drops from 180s+ (3Γ— receipt timeout) to ~5-10s (one timeout + quick on-chain check + one
retry). The accumulator loop stays as-is β€” the problem was never the architecture, it was the slow error recovery paths.

One small addition: we should cap the number of items that accumulate during a flush to prevent unbounded growth. The channel is
already bounded (max_create_batch_size * 2 = 240), which provides backpressure. That's sufficient.

@denniswon
Copy link
Copy Markdown
Author

❯ await-ing receipt timeouts (up to 180s with 3 retries) --> this is way too long. definitely remove and retries should be just 2 for
batches. agreed overall, but double evaluate your suggestion/recommendation

⏺ You're right β€” let me re-evaluate.

The 60s receipt timeout itself is the problem. On Sepolia, block time is 12s. A legitimate transaction should be mined within 1-2
blocks (12-24s). If we haven't gotten a receipt in 30s, the tx is almost certainly either:

  1. Dropped from mempool (nonce replaced, gas too low)
  2. Mined but the RPC node is lagging

Waiting 60s before checking is just dead time. And with 3 retries, that's potentially 180s of waiting for nothing.

My revised recommendation:

  1. Reduce BATCH_RECEIPT_TIMEOUT to 30s β€” still generous for Sepolia (2.5 blocks). On L2s like Base (2s block time), 30s is 15 blocks
    β€” more than enough.
  2. Max retries = 2 for batch methods (down from 3). After the first timeout, we check allTaskHashes. If items are on-chain β†’
    idempotent success. If not β†’ one more retry with fresh nonce. If that also fails, items are failed out.
  3. On receipt timeout, don't retry the batch blindly. The sequence becomes:
    - Timeout fires after 30s
    - Check allTaskHashes for each item in the batch (~1-2s)
    - Items on-chain β†’ idempotent success
    - Items NOT on-chain β†’ re-batch and retry once (attempt 2)
    - If attempt 2 also times out β†’ fail remaining items

Worst case: 30s timeout + 2s check + 30s retry = ~62s. Current worst case: 180s+. That's a 3x improvement, and the common case (items
landed but receipt timed out) resolves in ~32s.

The max_retries = 2 config applies to the batch submitter config, not the writer-level timeout. The writer always does a single
attempt with timeout β€” retries are orchestrated by the batch submitter.

❯ sounds good.

⏺ Question 5: The parse_batch_partial_failure implementation. The as_revert_data() API is already proven at writer.rs:1091. Should we
decode BatchPartialFailure at the writer level (before the trait boundary) or at the batch submitter level?

Looking at the two options:

Option A: Decode at writer level (current stub location, writer.rs:795):

  • parse_batch_partial_failure uses error.as_revert_data() to get raw bytes
  • Checks for 0x19ed9977 selector, decodes FailedItem[] via BatchPartialFailure::abi_decode_raw
  • Returns Some(Vec) with per-item index, task_id, and nested revert reason
  • The existing Err(ChainIoError::BatchPartialFailure { failures }) path already works β€” it just never gets populated because the
    parser returns None

Option B: Decode at batch submitter level:

  • Would require passing the raw alloy::contract::Error through the trait somehow
  • Trait returns Result<_, ChainIoError> (per Question 1), but ContractError is already a variant

My recommendation: Option A β€” decode at the writer level. This is the natural location because:

  1. The writer has direct access to alloy::contract::Error and as_revert_data()
  2. The ChainIoError::BatchPartialFailure variant already exists with the right fields
  3. With the Question 1 fix (trait returns ChainIoError), the batch submitter gets the structured BatchPartialFailure { failures }
    directly
  4. extract_failed_indices in batch_submitter becomes trivial: just match on the ChainIoError::BatchPartialFailure variant and extract
    indices

The nested revert reason bytes inside each FailedItem.reason should also be decoded where possible β€” check for TaskAlreadyExists
(0x2e98c533), TaskAlreadyResponded (0x68905dff), and other known selectors. This lets strip_combined_by_indices classify each
stripped item correctly (idempotent vs poison) without simulation.

@denniswon
Copy link
Copy Markdown
Author

Excellent point. Let me think about this more carefully.

The root of the cascading failure was ambiguity in error classification. When the system couldn't definitively identify an error, it
fell back to heuristics (string matching, simulation) that guessed wrong. We need to eliminate guessing entirely.

The definitive approach: Every error that crosses the writer β†’ batch submitter boundary must be exhaustively classified at the source
β€” the writer level β€” where we have full access to the raw error type.

Here's the concrete design:

At writer.rs, the batch_create_and_respond_to_tasks method currently has two exit paths:

  1. .send().await succeeds β†’ pending.get_receipt() β†’ either success or timeout
  2. .send().await fails β†’ ContractError (gas estimation revert)

For path 2, instead of the current parse_batch_partial_failure || fallback to ContractError, we do exhaustive error decoding:

let revert_data = error.as_revert_data();
match revert_data {
None => ChainIoError::ContractError(e), // not a revert (RPC/transport error)
Some(data) if data.is_empty() => ChainIoError::CreateNewTaskCallFail {
reason: "empty revert (assembly-level failure)"
},
Some(data) if data.len() < 4 => ChainIoError::CreateNewTaskCallFail {
reason: "malformed revert data (< 4 bytes)"
},
Some(data) => {
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
BatchPartialFailure::SELECTOR => // decode FailedItem[], return BatchPartialFailure
TaskAlreadyExists::SELECTOR => // return new variant TaskAlreadyExists
TaskAlreadyResponded::SELECTOR => // return new variant TaskAlreadyResponded
TaskCreatedBlockTooOld::SELECTOR => // return specific variant
_ => // return ContractRevert { selector, data } β€” preserves raw bytes
}
}
}

Key principle: the selector is always available and always definitive. We decode what we recognize, and for anything we don't
recognize, we preserve the raw selector + data in a new ChainIoError::ContractRevert { selector: [u8; 4], data: Bytes } variant. No
string matching. No guessing. No fallback to simulation for classification.

Then the BatchPartialFailure payload itself contains FailedItem[] where each item has a reason: bytes β€” that's the nested revert from
each individual _createAndRespond call. We apply the same selector-based decoding to each nested reason:

for failure in &failures {
let nested_selector = failure.reason[..4];
match nested_selector {
TaskAlreadyExists::SELECTOR => classification = Idempotent,
TaskAlreadyResponded::SELECTOR => classification = Idempotent,
_ => classification = Poison(selector, data),
}
}

This way the batch submitter never needs to string-match, simulate, or guess. Every item is classified definitively by its 4-byte
selector at the writer level.

For receipt timeout (path 1): we already agreed β€” direct allTaskHashes check, not guessing.

Simulation becomes truly dead-letter: only reachable if somehow as_revert_data() returns None for a contract revert (which would be
an alloy bug, not our code). We can log a warning and still fall through to eth_call-based simulation as the absolute last resort,
but it should never fire in practice.

Does this exhaustive selector-based decoding approach address your concern?

@denniswon
Copy link
Copy Markdown
Author

Question 6: We need new ChainIoError variants for the exhaustive classification. Here's the proposed enum expansion:

Currently the batch path can produce these ChainIoError variants:

  • CreateNewTaskCallFail { reason: String } β€” catch-all string bucket
  • BatchPartialFailure { failures } β€” structured but never populated
  • ContractError(alloy::contract::Error) β€” raw alloy error
  • AlloyProviderError β€” provider-level error

Proposed additions:

// New variants in ChainIoError:

/// Task already exists on-chain (idempotent β€” prior attempt landed)
TaskAlreadyExists { task_id: Option },

/// Task already responded (idempotent β€” prior create+respond landed)
TaskAlreadyResponded { task_id: Option },

/// Contract reverted with a recognized but non-recoverable error
ContractRevert { selector: [u8; 4], data: Bytes },

/// Receipt wait timed out β€” tx may or may not be on-chain
ReceiptTimeout { timeout_secs: u64 },

And update BatchPartialFailure.failures items to carry a classification enum instead of raw bytes:

pub enum BatchItemFailure {
AlreadyExists,
AlreadyResponded,
Revert { selector: [u8; 4], data: Bytes },
Unknown { data: Bytes },
}

pub struct BatchFailedItem {
pub index: usize,
pub task_id: B256,
pub failure: BatchItemFailure,
}

This means extract_failed_indices in the batch submitter becomes a simple pattern match β€” no string parsing, no simulation. Each
failed item is already classified.

My recommendation: This is the right set of variants. ReceiptTimeout replaces the current CreateNewTaskCallFail { reason: "...timed
out..." } string hack. TaskAlreadyExists/TaskAlreadyResponded surface directly instead of being detected via string matching.
ContractRevert catches everything else with preserved data.

The task_id: Option on TaskAlreadyExists/TaskAlreadyResponded is Option because when a whole batch reverts with these, we may
not know which specific task β€” but for BatchPartialFailure nested reasons, we always know the task_id from the FailedItem.

Agree with these variants?

@denniswon
Copy link
Copy Markdown
Author

Good question. Let me evaluate both approaches for check_tasks_on_chain.

Sequential reads vs Multicall

Context: After a receipt timeout on a batch of N items, we need to check allTaskHashes[taskId] for each item to determine which tasks
actually landed on-chain.

Option A: Sequential reads

for item in &items {
let hash = task_manager.allTaskHashes(item.task_id).call().await?;
landed.push(hash != B256::ZERO);
}

Pros:

  • Simple, no extra dependency
  • Each call is independent β€” partial failure doesn't lose results
  • alloy already has the generated binding for allTaskHashes

Cons:

  • N RPC round-trips (one per task in the batch)
  • At batch size 120, that's 120 sequential calls after a timeout β€” adding 10-30s latency on Sepolia
  • Compounds the timeout problem we're trying to fix

Option B: Multicall3

let multicall = Multicall3::new(provider, MULTICALL3_ADDR);
let calls: Vec<_> = items.iter().map(|item| Call3 {
target: task_manager_addr,
callData: allTaskHashesCall { taskId: item.task_id }.abi_encode(),
allowFailure: true,
}).collect();
let results = multicall.aggregate3(calls).call().await?;

Pros:

  • Single RPC round-trip regardless of batch size
  • 120 items checked in ~200ms instead of ~30s
  • Multicall3 is deployed at a deterministic address on all EVM chains (including anvil with --auto-impersonate)

Cons:

  • Needs Multicall3 address constant (but it's 0xcA11bde05977b3631167028862bE2a173976CA11 on every chain)
  • Slightly more complex ABI encoding
  • If Multicall3 isn't deployed (unlikely β€” it's on 99%+ of chains), entire check fails

Option C: eth_call batch simulation (middle ground)

// Simulate the entire batch via eth_call β€” if it succeeds, nothing is on-chain yet
// If it reverts with BatchPartialFailure, decode which items exist
let result = batch_contract
.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await;

This is what we already planned for simulate_batch_create_and_respond. It doubles as the on-chain check β€” items that revert with
TaskAlreadyExists are confirmed on-chain.

Pros:

  • Single RPC call
  • Reuses existing contract ABI (no Multicall3 dependency)
  • Gives us BOTH information: what's on-chain AND what would succeed if resubmitted

Cons:

  • Heavier than a simple storage read (runs full EVM execution per item)
  • Gas estimation overhead (~400K per item Γ— 120 = 48M gas simulated)

My recommendation: Option C (eth_call simulation)

Here's why β€” we already need simulate_batch_create_and_respond for the poison detection path. After a receipt timeout, the question
isn't just "which tasks landed?" β€” it's "which tasks can be resubmitted?" The eth_call simulation answers both in one call:

  • Items that revert with TaskAlreadyExists (0x2e98c533) β†’ confirmed on-chain, send idempotent success
  • Items that revert with TaskAlreadyResponded (0x68905dff) β†’ confirmed on-chain AND responded, send idempotent success
  • Items that succeed β†’ not on-chain, safe to retry
  • Items that revert with other errors β†’ poison, strip them

This means we don't need a separate check_tasks_on_chain method at all. The receipt timeout path becomes:

timeout β†’ simulate_batch(remaining_items) β†’ classify each item β†’ retry/succeed/fail

One method, one RPC call, complete information.

β˜… Insight ─────────────────────────────────────
The key insight is that eth_call simulation is strictly more informative than allTaskHashes reads. A storage read tells you "this
task exists on-chain." A simulation tells you "this task exists on-chain AND here's exactly what would happen if you tried to
resubmit it." Since we need the simulation for poison detection anyway, using it for the timeout path too eliminates an entire method
from the trait surface.
─────────────────────────────────────────────────

So the revised TaskSubmitter trait change is:

// Add ONE new method (not two):
async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError>;
// Returns Ok(()) if all would succeed, BatchPartialFailure if some fail,
// or other ChainIoError for RPC issues

// Change return type on existing method:
async fn batch_create_and_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
// (was Result with anyhow)

Does this approach make sense, or do you still want the separate check_tasks_on_chain method?

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #42/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7 β€” whether Option C (single simulate_batch_create_and_respond method that handles both timeout
recovery and poison detection) works, or if you'd prefer separate methods.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #43/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Still waiting on your input for Q7 before proceeding with implementation.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #44/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #45/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #46/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your input on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #47/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #48/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #49/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #50/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

✻ Cooked for 1m 23s

@denniswon
Copy link
Copy Markdown
Author

⏺ Plan is written. Let me exit plan mode so you can review and approve before implementation begins.

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ready to code?

Here is Claude's plan:
β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ
Harden Batch Transaction Submission Pipeline

Context

PR #507 (feat/batch-create-and-respond) merged and deployed to stagef+prod. Load testing (15 tasks, centralized mode, single gateway)
showed 60% success (9/15) with 6 recoverable tasks misclassified as poison. Root cause chain:

  1. TransportError(NullResp) on batch 8 triggers retry
  2. Receipt timeout (60s x 3 retries = 180s) blocks accumulator
  3. parse_batch_partial_failure() returns None (stub) β€” falls to simulation
  4. Simulation sends REAL transactions (not eth_call), misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
  5. 6 tasks that were actually on-chain get failed to callers
  6. 15 zombie BLS aggregator slots linger 360s

Every failure in this cascade is recoverable. The protocol is correct β€” the error handling pipeline loses information at each layer.

Design Decisions (from grill-me interview)

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Decision β”‚ Rationale β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ TaskSubmitter returns Result<_, ChainIoError> β”‚ Stop losing structured error data via format!() β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Receipt timeout β†’ eth_call simulation β”‚ Determine which items landed on-chain without blind β”‚
β”‚ β”‚ β”‚ retry β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3 β”‚ Poison detection via eth_call (not real txs) β”‚ Current simulation sends real transactions β€” wasteful β”‚
β”‚ β”‚ β”‚ and dangerous β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4 β”‚ Keep accumulator architecture, timeout=30s, max_retries=2 β”‚ Accumulator is sound; reduce wasted time in retry loops β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ Decode BatchPartialFailure at writer level β”‚ Use as_revert_data() (proven at writer.rs:1091) β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6 β”‚ Exhaustive selector-based decoding β”‚ No string matching, no guessing, no fallback defaults β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7 β”‚ Single simulate_batch method for both timeout recovery and β”‚ One eth_call gives complete classification β”‚
β”‚ β”‚ poison detection β”‚ β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Files to Modify

Layer 1: Error types and classification

  1. crates/chainio/src/error.rs β€” Add new ChainIoError variants, expand is_transient()
  2. crates/chainio/src/avs/errors.rs β€” Add exhaustive selector constants and classify_batch_item_error()

Layer 2: Writer (on-chain interaction)

  1. crates/chainio/src/avs/writer.rs β€” Implement parse_batch_partial_failure(), reduce timeout, add
    simulate_batch_create_and_respond()

Layer 3: TaskSubmitter trait bridge

  1. crates/gateway/src/task/submitter.rs β€” Change return types to ChainIoError, add simulate_batch_create_and_respond()

Layer 4: Batch submitter (retry logic)

  1. crates/gateway/src/rpc/api/batch_submitter.rs β€” Rewrite flush logic to use structured errors, remove string matching

Layer 5: Config

  1. crates/gateway/src/rpc/api/batch_submitter.rs β€” Change defaults: max_retries=2

Layer 6: Mock impls (10 total)

  1. All TaskSubmitter test impls β€” add simulate_batch_create_and_respond(), update return types

Step-by-Step Implementation

Step 1: New ChainIoError variants (crates/chainio/src/error.rs)

Add these variants to the existing enum:

/// Task already exists on-chain (idempotent β€” not an error)
#[error("Task {task_id} already exists on-chain")]
TaskAlreadyExists { task_id: B256 },

/// Task already responded to on-chain (idempotent β€” not an error)
#[error("Task {task_id} already responded")]
TaskAlreadyResponded { task_id: B256 },

/// Contract reverted with a known error selector
#[error("Contract reverted: {name} (0x{selector})")]
ContractRevert {
selector: String, // 4-byte hex, e.g. "2e98c533"
name: String, // e.g. "TaskAlreadyExists"
raw_data: Vec, // full revert data for downstream decoding
},

/// Per-item failure within a batch, with selector-decoded reason
#[derive(Debug, Clone)]
pub enum BatchItemError {
TaskAlreadyExists,
TaskAlreadyResponded,
ContractRevert { selector: String, name: String },
Unknown { raw: Vec },
}

Update is_transient() to also match TransactionTimeout:

pub fn is_transient(&self) -> bool {
matches!(
self,
ChainIoError::RpcError()
| ChainIoError::AlloyProviderError(
)
| ChainIoError::TransactionTimeout { .. }
)
}

Step 2: Exhaustive selector classification (crates/chainio/src/avs/errors.rs)

Add classify_batch_item_revert(data: &[u8]) -> BatchItemError:

pub fn classify_batch_item_revert(data: &[u8]) -> BatchItemError {
if data.len() < 4 {
return BatchItemError::Unknown { raw: data.to_vec() };
}
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
[0x2e, 0x98, 0xc5, 0x33] => BatchItemError::TaskAlreadyExists,
[0x68, 0x90, 0x5d, 0xff] => BatchItemError::TaskAlreadyResponded,
_ => {
let name = decode_error_selector(selector)
.unwrap_or_else(|| format!("0x{}", hex::encode(selector)));
BatchItemError::ContractRevert {
selector: hex::encode(selector),
name,
}
}
}
}

Where decode_error_selector is the existing function in writer.rs that maps selectors to names. We'll make it pub and move the
selector-to-name mapping to errors.rs.

Step 3: Implement parse_batch_partial_failure() (crates/chainio/src/avs/writer.rs)

Replace the stub at line 795 with real implementation using as_revert_data():

fn parse_batch_partial_failure(
error: &alloy::contract::Error,
) -> Option<Vec> {
let revert_data = error.as_revert_data()?;
if revert_data.len() < 4 {
return None;
}
let selector: [u8; 4] = revert_data[..4].try_into().ok()?;
// BatchPartialFailure selector: 0x19ed9977
if selector != [0x19, 0xed, 0x99, 0x77] {
return None;
}
// Decode: BatchPartialFailure(FailedItem[] failures)
// FailedItem = (uint256 index, bytes32 taskId, bytes reason)
// Use alloy ABI decoding
use alloy::sol_types::SolType;
type FailedItemTuple = alloy::sol_types::sol_data::Array<
(alloy::sol_types::sol_data::Uint<256>,
alloy::sol_types::sol_data::FixedBytes<32>,
alloy::sol_types::sol_data::Bytes)
>;
let decoded = FailedItemTuple::abi_decode(&revert_data[4..], true).ok()?;
let items = decoded.into_iter().map(|(index, task_id, reason)| {
BatchFailedItem {
index: index.to::(),
task_id: B256::from(task_id),
reason: reason.into(),
}
}).collect();
Some(items)
}

On error path in batch_create_and_respond_to_tasks, use it to return ChainIoError::BatchPartialFailure:

Err(e) => {
if let Some(failures) = parse_batch_partial_failure(&e) {
return Err(ChainIoError::BatchPartialFailure { failures });
}
// Classify the top-level revert
if let Some(revert_data) = e.as_revert_data() {
let item_err = errors::classify_batch_item_revert(&revert_data);
// Convert to appropriate ChainIoError variant
return Err(match item_err { ... });
}
Err(ChainIoError::ContractError(e))
}

Step 4: Add simulate_batch_create_and_respond() to AvsWriter

New method using .call() (eth_call) instead of .send():

pub async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError> {
// Same ABI encoding as batch_create_and_respond_to_tasks
// but use .call().await instead of .send().await
contract.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await
.map(|_| ())
.map_err(|e| {
if let Some(failures) = parse_batch_partial_failure(&e) {
return ChainIoError::BatchPartialFailure { failures };
}
ChainIoError::ContractError(e)
})
}

Step 5: Reduce BATCH_RECEIPT_TIMEOUT (crates/chainio/src/avs/writer.rs)

const BATCH_RECEIPT_TIMEOUT: Duration = Duration::from_secs(30); // was 60

Step 6: Change TaskSubmitter trait (crates/gateway/src/task/submitter.rs)

Change batch method return types from Result (anyhow) to Result<TransactionReceipt, ChainIoError>:

async fn batch_create_and_respond_to_tasks(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<TransactionReceipt, ChainIoError>;

async fn simulate_batch_create_and_respond(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<(), ChainIoError>;

// Also change these existing methods:
async fn batch_create_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
async fn batch_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;

Remove the format!() error wrapping in the AvsWriter impl β€” return ChainIoError directly.

Step 7: Rewrite flush_combined_batch_with_retry() (crates/gateway/src/rpc/api/batch_submitter.rs)

The core rewrite. Replace string-matching error classification with pattern matching on ChainIoError variants.

Change max_retries default from 3 to 2:
fn default_batch_max_retries() -> u32 { 2 } // was 3

New flush logic (pseudocode):

loop {
match task_submitter.batch_create_and_respond_to_tasks(...).await {
Ok(receipt) => {
// Success β€” notify all callers
for (item, task, ..) in built_items {
let _ = item.result_tx.send(Ok((receipt.clone(), task)));
}
return;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Structured per-item failures β€” classify each
let failed_map: HashMap<usize, &BatchFailedItem> = ...;
let mut healthy = vec![];
for (i, item) in built_items.into_iter().enumerate() {
if let Some(failure) = failed_map.get(&i) {
let classification = classify_batch_item_revert(&failure.reason);
match classification {
BatchItemError::TaskAlreadyExists
| BatchItemError::TaskAlreadyResponded => {
// Idempotent success
let _ = item.0.result_tx.send(Ok((dummy_receipt(), item.1)));
}
_ => {
// Non-recoverable β€” fail this item
let _ = item.0.result_tx.send(Err(GatewayError::Internal(...)));
}
}
} else {
healthy.push(item);
}
}
if healthy.is_empty() { return; }
built_items = healthy;
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue; // retry healthy remainder
}
Err(e) if e.is_transient() => {
attempt += 1;
if attempt > max_retries { fail_all; return; }
tokio::time::sleep(backoff).await;
backoff = min(backoff * 2, Duration::from_secs(5));
continue;
}
Err(ChainIoError::TransactionTimeout { .. }) => {
// Receipt timeout β€” simulate to classify
match task_submitter.simulate_batch_create_and_respond(...).await {
Ok(()) => {
// Nothing on-chain yet β€” safe to retry
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Some items landed β€” classify each (same as above)
...
}
Err(e) => {
// Simulation itself failed β€” treat as transient
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
}
}
Err(e) => {
// Unknown non-transient error β€” try simulation to salvage
match task_submitter.simulate_batch_create_and_respond(...).await {
// Same classification as timeout path
...
}
}
}
}

Remove:

  • extract_failed_indices() β€” replaced by direct pattern match on BatchPartialFailure
  • identify_combined_poison_items() β€” replaced by simulate_batch_create_and_respond()
  • is_transient_error() helper β€” replaced by ChainIoError::is_transient()
  • All error_str.contains(...) string matching

Step 8: Update identify_combined_poison_items β†’ simulation via eth_call

Replace the current implementation that sends REAL single-item batch transactions with eth_call simulation:

async fn simulate_and_classify_items(
items: Vec,
task_submitter: &Arc,
batch_contract_addr: Address,
) -> (Vec, Vec, Vec) {
// (healthy, poison, idempotent)
// For each item, call simulate_batch_create_and_respond with a single-item batch
// Classify the result using BatchItemError
}

Step 9: Return TransactionTimeout from writer on receipt timeout

Currently the timeout path in batch_create_and_respond_to_tasks returns a generic error. Change to:

Err(_elapsed) => {
Err(ChainIoError::TransactionTimeout { timeout_secs: BATCH_RECEIPT_TIMEOUT.as_secs() })
}

Step 10: Update all mock TaskSubmitter implementations (10 impls)

Add simulate_batch_create_and_respond returning Ok(()) (default: all items would succeed).
Change return types on batch methods from Result to Result<TransactionReceipt, ChainIoError>.

Impls to update:

  1. SuccessSubmitter β€” batch_submitter.rs tests
  2. FailingSubmitter β€” batch_submitter.rs tests
  3. PoisonDetectingSubmitter β€” batch_submitter.rs tests
  4. StallTaskSubmitter β€” tx_worker.rs tests
  5. MockAvsWriter β€” chain_routing.rs tests
  6. MockAvsWriter β€” common/mod.rs tests
  7. MockAvsWriter β€” send_task.rs tests
  8. DummySubmitter β€” secrets_encrypted.rs tests
  9. RecordingTaskSubmitter β€” create_task_proof_cid.rs tests
  10. AvsWriter production impl β€” submitter.rs

Step 11: Update batch_submitter tests

Update existing tests to verify new error handling:

  • test_batch_create_success β†’ verify structured receipt propagation
  • test_batch_create_with_poison_items β†’ verify BatchPartialFailure decoding and strip
  • test_batch_create_panic_isolation β†’ unchanged (panic handling)
  • test_batch_submit_multiple_items β†’ verify idempotent success for TaskAlreadyExists

Add new tests:

  • test_batch_timeout_then_simulate β€” receipt timeout β†’ simulation classifies items
  • test_batch_partial_failure_decoding β€” verify BatchPartialFailure selector decoding
  • test_batch_max_retries_2 β€” verify max 2 retries (not 3)

Verification

  1. make fmt && make clippy β€” no warnings
  2. cargo test -p newton-prover-chainio --lib β€” writer/error tests pass
  3. cargo test -p newton-prover-gateway --lib β€” batch_submitter tests pass
  4. make gateway_e2e β€” single-chain E2E
  5. make e2e scenario=basic β€” unified E2E (production config path)

@denniswon
Copy link
Copy Markdown
Author

β˜… Insight ─────────────────────────────────────
Here's what this implementation achieves across the full error pipeline:

  1. No more string matching: Every error classification uses typed ChainIoError variants or 4-byte selector matching via
    classify_batch_item_revert(). Zero contains() calls on error strings.
  2. No more stub functions: parse_batch_partial_failure() now uses as_revert_data() to extract and ABI-decode FailedItem[] from
    BatchPartialFailure reverts. extract_failed_indices() (the empty stub) is deleted entirely.
  3. Simulation uses eth_call, not real txs: simulate_and_classify() calls simulate_batch_create_and_respond() which uses .call()
    (read-only) instead of .send(). No more wasting gas on poison detection.
  4. Reduced retry budget: max_retries defaults to 2 (was 3), BATCH_RECEIPT_TIMEOUT is 30s (was 60s). Worst-case cascade: 30s Γ— 2 =
    60s, down from 60s Γ— 3 = 180s.
    ─────────────────────────────────────────────────

Here's a summary of all changes made:

Layer 1: Error types (crates/chainio/src/error.rs)

  • Added BatchItemError enum with selector-based classification
  • Added ChainIoError variants: TaskAlreadyExists, TaskAlreadyResponded, ContractRevert, TransactionTimeout, RetriesExhausted,
    TransactionReverted
  • Added is_transient() (includes TransactionTimeout) and is_idempotent() methods

Layer 2: Error classification (crates/chainio/src/avs/errors.rs)

  • Added selectors module with byte-array constants
  • Added classify_batch_item_revert() β€” exhaustive 4-byte selector classification
  • Added classify_top_level_revert() β€” top-level revert to ChainIoError variant

Layer 3: Writer (crates/chainio/src/avs/writer.rs)

  • BATCH_RECEIPT_TIMEOUT: 60s β†’ 30s
  • Implemented parse_batch_partial_failure() (was stub returning None)
  • Added simulate_batch_create_and_respond() using .call() (eth_call)
  • Fixed all batch method error paths with typed ChainIoError variants

Layer 4: TaskSubmitter trait (crates/gateway/src/task/submitter.rs)

  • Batch methods return Result<, ChainIoError> (was Result<> via GatewayError)
  • Added simulate_batch_create_and_respond() to trait + AvsWriter impl

Layer 5: Batch submitter (crates/gateway/src/rpc/api/batch_submitter.rs)

  • max_retries default: 3 β†’ 2
  • Removed is_transient_error(), extract_failed_indices(), strip_combined_by_indices(), strip_by_indices()
  • Rewrote flush_combined_batch_with_retry() with typed ChainIoError pattern matching
  • Rewrote flush_respond_batch() similarly
  • Replaced identify_combined_poison_items() (real txs) with simulate_and_classify() (eth_call)

Layer 6: Mock impls (6 test files)

  • Updated all 9 mock TaskSubmitter impls with new return types and simulate_batch_create_and_respond()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment