Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created March 26, 2026 22:34
Show Gist options
  • Select an option

  • Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.
Plan: High-Throughput Batch Transaction Submission
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer β€” all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC β†’ NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s Β· p95: 84s Β· p99: 94s
End-to-End (RPC β†’ on-chain response)
- 82/200 completed (41%)
- p50: 138s Β· p95: 192s Β· p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β€” only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/
⎿  Read crates/gateway/src/rpc/api/tx_worker.rs (597 lines)
⎿  Listed directory .claude/
⏺ This is a critical performance issue. The load test data tells a clear story: the gateway's HTTP layer is fine (sub-4s), but the single-threaded TransactionSubmissionWorker is creating a fatal bottleneck β€” 59% of tasks never complete
end-to-end.
Let me deeply explore the architecture before proposing solutions.
⏺ 3 Explore agents finished (ctrl+o to expand)
β”œβ”€ Explore tx submission flow Β· 35 tool uses Β· 114.9k tokens
β”‚ ⎿ Β Done
β”œβ”€ Explore AvsWriter nonce mgmt Β· 22 tool uses Β· 118.9k tokens
β”‚ ⎿ Β Done
└─ Explore gateway config and flow Β· 29 tool uses Β· 106.3k tokens
⎿  Done
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer β€” all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC β†’ NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s Β· p95: 84s Β· p99: 94s
End-to-End (RPC β†’ on-chain response)
- 82/200 completed (41%)
- p50: 138s Β· p95: 192s Β· p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β€” only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ /clarify-plan
⏺ Skill(/clarify-plan)
⎿  Successfully loaded skill
⏺ Good β€” I've completed a deep exploration of the gateway's transaction submission architecture. Here's what I found before asking questions:
β˜… Insight ─────────────────────────────────────
The core bottleneck is architectural, not configurational. The TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask AND respondToTask) through a single sequential loop β€” one tokio task, one mpsc channel,
one-at-a-time processing. Each transaction blocks on get_receipt() (12-60s waiting for block confirmation). At 40 req/min generating 80 txs/min, but draining at ~2-4 txs/min, the queue is fundamentally oversubscribed by 20-40x.
─────────────────────────────────────────────────
Here are my first three questions:
1. Which chain was this load test running against, and what's the target chain for production?
Block time matters enormously here β€” Base mainnet (2s blocks) vs Sepolia (12s) vs Base Sepolia (2s) changes the theoretical max throughput by 6x. And the nonce pipelining strategy differs by chain's mempool behavior.
2. Is the single-signer constraint (one ECDSA key for both createNewTask and respondToTask) a hard requirement, or can the gateway operate with multiple signer keys?
The current serialization exists solely to prevent nonce conflicts on a shared signer. If we can use N signers (e.g., a pool of hot wallets), we get NΓ— parallelism trivially. If not, we need nonce pipelining (submit tx N+1 before tx N
confirms), which is more complex but still achievable.
3. What's the target throughput and latency you need?
The load test hit 40 req/min and broke at 41% completion. Are we aiming for 40 req/min sustained with >99% completion? 100 req/min? 1000? And what's the acceptable e2e latency β€” is the current 2.8s HTTP response fine (since on-chain is async),
or do callers need on-chain confirmation within a specific window?
✻ Brewed for 2m 59s
❯ 1. sepolia, for prod environment with 10 gateway replicas with 10 task generator signer private keys, 1 for each 2. not a hard requirement I think, but please review the code base (including the solidity contracts code for createNewTask and
respondToTask to confirm my understanding), gateway can operate with multiple signer keys but all task generators have to be configured onchain (createNewTask, and responseToTask guarded with onlyTaskGenerator guard). If my understanding is not
correct though, then nonce pipelining would be needed 3. we are targeting at least 100 RPS (100 requests per second). Acceptable e2e latency for offchain response is around 3 seconds or lower, not longer, ideally around 2 seconds. For onchain,
we have more room, but definitely we want onchain e2e to happen by 1 min of request submission to gateway, ideally within 15-30 seconds (on Ethereum L1/sepolia)
⏺ 2 Explore agents finished (ctrl+o to expand)
β”œβ”€ Explore TaskManager access control Β· 21 tool uses Β· 90.8k tokens
β”‚ ⎿ Β Done
└─ Explore nonce pipelining feasibility Β· 28 tool uses Β· 109.1k tokens
⎿  Done
⏺ Excellent findings. Let me synthesize what we now know before asking the next batch of questions.
β˜… Insight ─────────────────────────────────────
Multiple task generators are fully supported on-chain. The OperatorRegistry uses EnumerableSet.AddressSet β€” both createNewTask and respondToTask use onlyTaskGenerator, and Sepolia config already has 2 generators configured. The old aggregator
role is deprecated. This means the multi-signer parallelism path is wide open β€” no nonce pipelining complexity needed.
The math for 100 RPS on Sepolia (12s blocks): Each task needs 2 txs (create + respond). At 100 RPS, that's 200 txs/sec. With 12s block time, each signer can pipeline ~1 tx per block optimistically, or ~5 txs/block with nonce pipelining. With N
signers doing round-robin, we need enough signers to absorb the throughput. But Sepolia's block gas limit and mempool behavior also constrain this.
─────────────────────────────────────────────────
Batch 2 β€” Architecture & Deployment Constraints:
4. You mentioned 10 gateway replicas with 10 task generator signer keys. Is each replica currently assigned exactly 1 signer key, or do all 10 replicas share the same key today?
This determines whether we're solving a within-replica parallelism problem (single replica needs multiple signers) or a cross-replica coordination problem (multiple replicas competing for the same nonce space). The 10-key setup suggests you've
already anticipated this β€” I want to confirm the current deployment topology.
5. For the 100 RPS target β€” is that 100 newt_createTask calls/sec, or 100 total RPC calls/sec across all methods?
At 100 task creations/sec on Sepolia (12s blocks), we'd need to land ~1,200 createNewTask + ~1,200 respondToTask = 2,400 txs per 12-second block. That's aggressive for a single chain. Are you planning to batch tasks, or is each RPC call = 1
on-chain task?
6. Is the respondToTask submission latency-critical, or only createNewTask?
Currently both flow through the same sequential queue. If respondToTask can be more relaxed (e.g., batched every few seconds), we could prioritize createNewTask in a fast lane and batch responses separately β€” significantly reducing queue
contention.
✻ Cooked for 2m 12s
❯ 4. each replica currently assigned exactly 1 signer key. But this can change if needed. For deploy pipeline for stagef vs prod, you can refer to @~/projects/deploy 5. 100 newt_createTask calls/sec 6. Neither has definitely latency requirement
to happen, but for sure we dont want to incur too much delay or discrepancies between offchain task/task response vs onchain. And if possible, yes, we should explore batching mechanism so that we can for example batch submit task
creation/response every interval (like that will be our "block/batch" time). 6. I initially made it same sequential queue in order to ensure that for a single task, task response NEVER gets submitted if the task has not been created
successfully. There is no difference between createNewTask vs respondToTask in terms of being how much latency critical
⎿  Listed directory ../deploy/
⏺ Great β€” the picture is becoming very clear. Let me ask one final batch to nail down the solution design.
β˜… Insight ─────────────────────────────────────
The math at 100 tasks/sec on Sepolia is extreme. 100 tasks/sec Γ— 2 txs Γ— 12s block time = 2,400 txs per block across the fleet. With 10 replicas at 1 signer each, that's 240 txs/block/replica. Individual transaction submission simply cannot
scale to this β€” batching is mandatory. A BatchTaskManager contract registered as a task generator could batch N createNewTask calls into a single tx, reducing the problem by 1-2 orders of magnitude.
─────────────────────────────────────────────────
Batch 3 β€” Solution Constraints:
7. Is deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable?
This is the highest-leverage change. A contract registered as a task generator that batches 50-100 tasks per transaction would cut the tx count by 50-100x. Same for respondToTask. The alternative β€” submitting 240 individual txs per block per
replica β€” is infeasible on any EVM chain regardless of nonce pipelining. Is adding a new contract to the deployment pipeline acceptable, or are we restricted to off-chain-only changes?
8. For the create→respond ordering invariant: is it acceptable if creates and responds are in different batches submitted sequentially (batch of creates first, then batch of responds a few seconds later)?
This preserves your invariant (no response without a successful create) while allowing independent batching windows. The tradeoff is slightly higher latency for the respond step (waits for the next batch interval).
9. What's the gas budget per transaction on Sepolia/mainnet for your task generator signers?
Batching 100 createNewTask calls in one tx could cost 20-50M gas. Sepolia's block gas limit is 30M. If individual createNewTask costs ~200-300K gas, we can fit ~100 per batch. But if it costs 500K+, we're limited to ~60. Do you know the
approximate gas cost of a single createNewTask and respondToTask today?
✻ Crunched for 39s
❯ 7. deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable 8. yes acceptable 9. I do not, is there a way to estimate?
⏺ Let me estimate gas costs from the existing test suite and contract code.
⏺ Bash(cd /Users/dennis.won/projects/newton-prover-avs/contracts && forge test --match-test "test_createTask\|test_respondToTask\|test_createNewTask" --gas-report -vv…)
⎿  Compiling 150 files with Solc 0.8.27
Solc 0.8.27 finished in 22.85s
Compiler run successful with warnings:
… +13 lines (ctrl+o to expand)
⎿  (timeout 2m)
⏺ Explore(Explore createNewTask gas usage)
⎿  Done (36 tool uses · 91.3k tokens · 59s)
(ctrl+o to expand)
⏺ Good β€” gas estimates from code analysis:
- createNewTask: ~100-150K gas
- respondToTask: ~400-800K gas (BLS verification dominates)
- Sepolia block gas limit: 30M
β˜… Insight ─────────────────────────────────────
Batching math is very favorable. At 150K gas/create, we can fit ~200 creates per batch tx. At 800K gas/respond, ~37 responds per batch. Per replica at 10 tasks/sec over a 12s block: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = ~5
txs per block per signer. That's trivially achievable. Batching turns a 240-tx-per-block problem into a 5-tx-per-block problem.
─────────────────────────────────────────────────
I now have enough context to design the solution. Let me move to the Design phase.
⏺ Plan(Design batch tx submission plan)
⎿  Done (49 tool uses · 99.0k tokens · 3m 32s)
(ctrl+o to expand)
⏺ The plan from the design agent is comprehensive. Let me verify one critical detail before writing the final plan.
⎿  Loaded ../.claude/rules/solidity.md
Searched for 1 pattern, read 1 file (ctrl+o to expand)
⏺ Now I have the full picture. Let me read the rest of respondToTask to confirm the batch contract's parameter requirements, then write the final plan.
Read 1 file (ctrl+o to expand)
⏺ Good β€” I have the full picture. Now let me write the final plan.
⏺ Updated plan
⎿  /plan to preview
⏺ The plan is written. Here's the summary:
β˜… Insight ─────────────────────────────────────
The key insight is that batching transforms the problem entirely. Instead of 240 individual transactions competing for block space per replica per block (impossible), we get 5 batch transactions per block (trivial). The BatchTaskManager
contract is the highest-leverage change β€” it reduces the on-chain tx count by ~50x. The Rust-side batch accumulator then collects tasks into time-windowed batches (2s intervals), and nonce pipelining ensures those 5 txs don't block
sequentially on receipt confirmation. The entire solution is feature-gated behind batch_submitter.enabled, so the old sequential worker stays as fallback.
The ordering invariant (no respondToTask before createNewTask confirms) is preserved via a DashMap<TaskId, ()> shared between the create and respond accumulators β€” responds are held until their corresponding create is confirmed on-chain.
─────────────────────────────────────────────────
4 layers, 6 implementation steps:
1. BatchTaskManager.sol β€” batch N calls into 1 tx (Solidity, no Rust dependency)
2. AvsWriter batch methods β€” Rust bindings to call the batch contract
3. BatchTransactionSubmitter β€” replaces tx_worker.rs with time-windowed batch accumulation
4. Nonce pipelining β€” submit 5 batch txs concurrently, await receipts in parallel
5. Integration testing + load test at 100 tasks/sec
6. Multi-signer pool (only if needed β€” math says it won't be)
⏺ User approved Claude's plan
⎿  Plan saved to: ~/.claude/plans/valiant-giggling-moler.md · /plan to edit
Plan: High-Throughput Batch Transaction Submission
Context
Load testing at just 40 req/min revealed a critical bottleneck: TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask and respondToTask) through a single sequential loop, blocking on get_receipt() (12-60s per
tx on Sepolia). Only 41% of tasks completed end-to-end; 64.5% hit the 180s timeout.
Target: 100 tasks/sec sustained across 10 gateway replicas (10/sec/replica), with <3s offchain response and <30s on-chain e2e on Sepolia (12s blocks).
Root cause: The worker can process ~2-4 txs/min but needs 200 txs/sec. Throughput gap: ~3000x.
Solution: Four coordinated layers β€” a Solidity batch contract, a Rust batch accumulator, nonce pipelining, and an optional multi-signer pool.
---
Layer 1: BatchTaskManager Solidity Contract
What
A new standalone contract that batches N createNewTask and N respondToTask calls into single transactions. Registered as a task generator on OperatorRegistry.
Why
At 150K gas/create, one batch tx fits ~200 creates (30M block gas limit). At 800K gas/respond, ~37 per batch. Per replica per 12s block at 10 tasks/sec: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = 5 txs per block instead of
240. This is the highest-leverage change.
Design
Error semantics: try/catch per item (not revert-all). A single TaskAlreadyExists or TaskAlreadyResponded in a batch of 120 must not revert the other 119.
File: contracts/src/BatchTaskManager.sol
contract BatchTaskManager {
INewtonProverTaskManager public immutable taskManager;
// Per-item result events
event BatchCreateResult(bytes32 indexed taskId, bool success, bytes reason);
event BatchRespondResult(bytes32 indexed taskId, bool success, bytes reason);
constructor(address _taskManager) {
taskManager = INewtonProverTaskManager(_taskManager);
}
function batchCreateTasks(
INewtonProverTaskManager.Task[] calldata tasks
) external returns (uint256 successCount) {
for (uint256 i; i < tasks.length;) {
try taskManager.createNewTask(tasks[i]) {
emit BatchCreateResult(tasks[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchCreateResult(tasks[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
function batchRespondToTasks(
INewtonProverTaskManager.Task[] calldata tasks,
INewtonProverTaskManager.TaskResponse[] calldata responses,
bytes[] calldata signatureDataArray
) external returns (uint256 successCount) {
require(tasks.length == responses.length && responses.length == signatureDataArray.length);
for (uint256 i; i < tasks.length;) {
try taskManager.respondToTask(tasks[i], responses[i], signatureDataArray[i]) {
emit BatchRespondResult(responses[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchRespondResult(responses[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
}
Access control: The BatchTaskManager address is registered as a task generator via OperatorRegistry.addTaskGenerator(). The contract itself needs no auth modifier β€” only registered task generators can call taskManager.createNewTask (the
contract is the msg.sender to TaskManager). Optionally add an onlyAuthorized mapping to prevent gas griefing from arbitrary callers.
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/src/BatchTaskManager.sol β”‚ Create β€” batch contract β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/src/interfaces/IBatchTaskManager.sol β”‚ Create β€” interface for Rust bindings β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/test/BatchTaskManager.t.sol β”‚ Create β€” unit tests β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/script/BatchTaskManagerDeployer.s.sol β”‚ Create β€” deployment script following DeploymentLib pattern β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Deployment steps
1. Deploy BatchTaskManager with existing TaskManager address
2. Call OperatorRegistry.addTaskGenerator(batchTaskManagerAddress)
3. Write deployment to script/deployments/batch-task-manager/{chainId}-{env}.json
4. Run make generate-bindings to auto-generate Rust types in crates/core/src/generated/
---
Layer 2: Batch Accumulator (Rust β€” replaces tx_worker.rs)
What
Replace the sequential TransactionSubmissionWorker with BatchTransactionSubmitter that accumulates tasks into time-windowed batches and submits via the BatchTaskManager contract.
Architecture
sync.rs ─────► CreateBatch Accumulator ──flush every 2s──► batchCreateTasks() on-chain
β”‚
(tracks confirmations via DashMap)
β”‚
sync.rs ─────► RespondBatch Accumulator ─flush after creates confirm──► batchRespondToTasks()
Two independent accumulator loops as background tokio tasks, connected by a shared DashMap<TaskId, ()> that tracks which creates have been confirmed on-chain.
Batch flush triggers
1. Timer: batch_interval_ms (default 2000ms β€” one batch per ~1 block on Sepolia)
2. Size: batch reaches max_batch_size
3. Both accumulator loops run independently
Ordering invariant
The respond accumulator holds items in a pending queue. On each flush tick, it moves items whose task_id appears in the confirmed_creates: DashMap<TaskId, ()> into the current respond batch. Items waiting longer than respond_wait_timeout_ms
(default 60s) are dropped with an error log.
Public interface (backward compatible)
BatchTransactionSubmitter exposes the same methods as TransactionSubmissionWorker:
- submit_create_task(task_id, params) -> oneshot::Receiver<Result<...>>
- submit_aggregated_response(task_id, task_request, service_response) -> Result<()>
- queue_capacity() -> usize / queue_depth() -> usize
This means sync.rs and handler/mod.rs only swap the type name.
Config
pub struct BatchSubmitterConfig {
pub enabled: bool, // default: false (feature-gated rollout)
pub max_create_batch_size: usize, // default: 120
pub max_respond_batch_size: usize, // default: 30
pub batch_interval_ms: u64, // default: 2000
pub respond_wait_timeout_ms: u64, // default: 60000
pub batch_contract_address: Address, // from deployment JSON
}
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Create β€” new batch accumulator β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/mod.rs β”‚ Modify β€” add mod batch_submitter export β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/config.rs β”‚ Modify β€” add BatchSubmitterConfig to GatewayConfig β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/handler/mod.rs β”‚ Modify β€” instantiate BatchTransactionSubmitter when enabled (lines ~549, ~856) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Modify β€” add batch_create_tasks() and batch_respond_to_tasks() methods β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/task/submitter.rs β”‚ Modify β€” add batch methods to TaskSubmitter trait (or create BatchTaskSubmitter trait) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Migration path
1. Phase A: Ship behind batch_submitter.enabled = false. Old TransactionSubmissionWorker remains default.
2. Phase B: Enable on stagef, load test. Flip default to true.
3. Phase C: Remove TransactionSubmissionWorker after production validation.
---
Layer 3: Nonce Pipelining
What
After batching reduces txs from 240/block to ~5/block, we still need those 5 txs submitted without blocking on each receipt sequentially. Split send() from get_receipt().
Design
In the batch flush path:
// Fire phase: submit all batch txs with incrementing nonces
let pending_txs: Vec<PendingTransaction> = vec![];
for batch in batches {
let pending = batch_contract.batchCreateTasks(batch).send().await?;
pending_txs.push(pending);
}
// Await phase: concurrently wait for all receipts
let receipts = futures::future::join_all(
pending_txs.into_iter().map(|tx| tx.get_receipt())
).await;
Alloy's nonce manager auto-increments on each send() call, so nonces [N, N+1, N+2, ...] are assigned correctly without manual tracking.
Recovery
If tx at nonce N fails after N+1 is in mempool: N+1 is stuck. The recovery path:
1. Detect the failure from the receipt
2. Re-submit N with bumped gas (replacement tx)
3. Or cancel N with a zero-value self-transfer and re-submit the batch
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Modify β€” add send_without_receipt() variant for pipelined submission β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Modify β€” use pipelined send in flush logic β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Layer 4: Multi-Signer Pool (Optional β€” only if needed)
What
Round-robin across N signer keys per replica. Each signer has independent nonce space.
When to implement
Only if load testing after Layers 1-3 shows a single signer can't keep up. The math says 5 txs/block is well within single-signer capacity, so this is likely unnecessary.
Design
pub struct SignerPool {
signers: Vec<Arc<AvsWriter>>,
next: AtomicUsize,
}
All signer addresses call the same BatchTaskManager contract (which is the registered task generator). No on-chain changes needed beyond Layer 1.
Files
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/signer_pool.rs β”‚ Create (only if needed) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/config.rs β”‚ Modify β€” add additional_signers to TaskGeneratorConfig β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Expected Performance
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ Current β”‚ After Fix β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Offchain response β”‚ 2.8s (OK) β”‚ 2.8s (unchanged) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ On-chain task creation β”‚ p50=50s β”‚ ~14s (2s accumulate + 12s block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ On-chain e2e β”‚ p50=138s β”‚ ~28s (create confirm + 2s + 12s block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Completion rate β”‚ 41% β”‚ >99% β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Max throughput/replica β”‚ ~2-4 txs/min β”‚ 10 tasks/sec (120/block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Txs per block/replica β”‚ 240 (impossible) β”‚ 5 (trivial) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Implementation Sequence
Step 1: BatchTaskManager contract (no Rust dependency)
1. Create contracts/src/BatchTaskManager.sol and interface
2. Create contracts/test/BatchTaskManager.t.sol with tests:
- All succeed, partial failure, auth checks, gas measurement
3. Create deployment script following DeploymentLib pattern
4. forge test && forge fmt
5. Run make generate-bindings
Step 2: AvsWriter batch methods (depends on Step 1 bindings)
1. Add batch_create_tasks() and batch_respond_to_tasks() to AvsWriter
2. Unit tests for the new methods
Step 3: BatchTransactionSubmitter (depends on Step 2)
1. Implement BatchTransactionSubmitter in batch_submitter.rs
2. Add BatchSubmitterConfig to GatewayConfig
3. Wire into handler/mod.rs behind enabled flag
4. Unit tests for accumulator logic, ordering invariant, flush triggers
Step 4: Nonce pipelining (depends on Step 3)
1. Add pipelined send to the flush path
2. Add nonce gap recovery logic
Step 5: Integration testing
1. Update E2E tests to exercise batch path
2. Load test: 100 tasks/sec sustained for 5 minutes
3. Verify ordering invariant
4. Verify partial batch failure handling
Step 6: Multi-signer pool (only if Step 5 load test shows need)
---
Verification
1. forge test β€” all BatchTaskManager contract tests pass
2. make clippy β€” no new warnings
3. make tests β€” existing unit tests unaffected (feature-gated)
4. make gateway_e2e β€” existing E2E still works with old worker (default)
5. make gateway_e2e with BATCH_SUBMITTER_ENABLED=true β€” E2E works with new batch path
6. Load test at 100 tasks/sec for 5 min: >99% completion, <30s on-chain e2e
---
Risks and Mitigations
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Risk β”‚ Mitigation β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Batch contract deployment fails β”‚ Old sequential worker remains as fallback (feature flag) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Partial batch failure loses tasks β”‚ try/catch per item; failures logged with task ID for retry β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Nonce gap from failed pipeline tx β”‚ Re-query nonce, re-submit remaining; fallback to sequential β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Respond submitted before create confirms β”‚ DashMap gating enforces ordering; items time out rather than submit early β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Gas estimation wrong for large batches β”‚ Conservative defaults (120 creates, 30 responds); configurable per env β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Offchain response latency regresses β”‚ Batch accumulator is completely decoupled from RPC response path β€” no impact β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
@denniswon
Copy link
Copy Markdown
Author

Good β€” so GATEWAY__TASK_GENERATOR__SIGNER__PRIVATE_KEY env var overrides the TOML config's gateway.task_generator.signer.private_key
via the __ separator convention. Each ECS service gets a different value from AWS Secrets Manager (TASK_GENERATOR_KEY_1 through
TASK_GENERATOR_KEY_10), injected as the same env var name.

This means each replica has its own independent nonce space. My original Root Cause 2 (nonce contention across replicas) is wrong. Let
me revise the analysis.


Revised Load Test Failure Analysis

Root Cause 1 (P0): Empty policy_params crashes ALL operator evaluations

This is unchanged and remains the dominant issue. Every quorum failure in the logs traces back to:

policy_params JSON is not valid policyParams object: EOF while parsing a value at line 1 column 0

At crates/chainio/src/identity_data.rs:364-368, empty policyParams bytes from the on-chain PolicyConfig are converted to "", which
serde_json::from_str can't parse. Both operators fail, quorum is 0%, every task fails.

The fix is a single guard before line 364:

if ctx.policy_params.is_empty() {
info!("Empty policy_params, proceeding without identity data.");
return Ok(None);
}

Root Cause 2 (Revised): Batch create transient errors β€” NOT nonce contention

Since each replica has its own signer key, the "batch create: transient error" pattern is not caused by cross-replica nonce races. The
actual cause is likely one of:

  1. RPC provider rate limiting or instability under load β€” 10 replicas simultaneously hitting Sepolia RPC endpoints (Tenderly gateway)
    could trigger rate limits (429) or internal errors (503)
  2. Per-replica nonce races between aggregator + batch submitter β€” within a single replica, the AvsWriter is shared via
    new_with_provider (shared nonce manager), but if there's a code path creating a second provider instance, intra-replica nonce conflict
    is possible
  3. Sepolia mempool congestion β€” under load testing, many batch txs from 10 different signers could cause mempool-level issues

The Datadog CSV doesn't show the actual error. The structured field error = %e is logged at WARN level (batch_submitter.rs:432), but
the CSV only captures the message template. To diagnose this precisely, we need the structured error field from Datadog β€” either via
the Datadog UI facets or by adding the error to the message string itself.

β˜… Insight ─────────────────────────────────────
The transient batch create errors are a secondary issue β€” they only affect on-chain task creation, not the evaluation path. Even if
batch creates succeed, the tasks would still fail quorum because of Root Cause 1 (empty policy_params). Fixing Root Cause 1 first is
critical because it blocks ALL task evaluations regardless of batch submission health.
─────────────────────────────────────────────────

Root Cause 3: Zombie aggregator tasks (360s timeout with 0 signatures)

Unchanged β€” these are the downstream effect of quorum failure. When centralized broadcast fails, the TaskAggregator slot sits idle for
360s. The fix is to cancel the aggregator immediately when quorum fails.


Revised Priority

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Issue β”‚ Impact β”‚ Fix β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ P0 β”‚ Empty policy_params β†’ operator β”‚ 100% of tasks fail β”‚ Early return in identity_data.rs if β”‚
β”‚ β”‚ deserialization crash β”‚ quorum β”‚ policy_params.is_empty() β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ P1 β”‚ Batch create transient errors (unknown β”‚ ~30% of batch creates β”‚ Surface actual error in log message; investigate RPC β”‚
β”‚ β”‚ RPC cause) β”‚ fail β”‚ provider limits β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ P2 β”‚ Zombie aggregator slots (360s idle) β”‚ Memory/resource waste β”‚ Cancel aggregator on quorum failure β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The empty policy_params fix is a one-liner that eliminates every quorum failure in this log. The batch create issue needs better
observability (surface the actual error string) before we can solve it. Want me to implement the P0 fix now?

@denniswon
Copy link
Copy Markdown
Author

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Claim β”‚ Status β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ fetch_identity_data crashes on empty policy_params β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Guard returns Ok(None) before JSON parse β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3 β”‚ AggregatorCore DashMap fields β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4 β”‚ cancel_task removes from both maps β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ Both quorum failure paths call cancel_task β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6 β”‚ Trait method added to AggregatorService β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7 β”‚ 6 mock impls + 1 real impl = 7 total β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 8 β”‚ Error inlined in batch log message strings β”‚ VERIFIED β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

No false statements found. PR #492 is ready: https://github.com/newt-foundation/newton-prover-avs/pull/492

✻ Worked for 1m 21s

❯ /docs-sync

⏺ Let me run through the docs-sync phases.

⏺ Bash(git log --oneline $(git merge-base HEAD main)..HEAD)
⎿  4604667f fix: guard empty policy_params, surface batch errors, cancel zombie aggregators

⏺ Bash(git diff $(git merge-base HEAD main)..HEAD --stat)
⎿   crates/aggregator/src/core.rs | 10 ++++++++++
crates/chainio/src/identity_data.rs | 7 +++++++
crates/gateway/src/rpc/api/batch_submitter.rs | 13 +++++--------
… +8 lines (ctrl+o to expand)

⏺ Single commit, 10 files. The changes are: (1) empty policy_params guard, (2) batch error log formatting, (3) cancel_task trait method

  • impls. Let me check if any docs reference these areas.

⏺ Bash(git diff $(git merge-base HEAD main)..HEAD)
⎿  diff --git a/crates/aggregator/src/core.rs b/crates/aggregator/src/core.rs
index 792238fe..64dcd2b8 100644
--- a/crates/aggregator/src/core.rs
… +78 lines (ctrl+o to expand)

⏺ Now let me check if any documentation references these areas and needs updating.

Searched for 2 patterns (ctrl+o to expand)

⏺ The docs reference fetch_identity_data and policy_params but describe the flow correctly β€” they mention parsing policy_params for
identity_domain. The new guard is a defensive check that happens before that parsing and doesn't change the documented flow. No doc
updates needed for the identity path.

Now check if lessons.md should get a new entry β€” this is the kind of production gotcha that belongs there:

Searched for 2 patterns (ctrl+o to expand)

⏺ No existing entry for this. Let me add it to lessons.md and the Common Pitfalls table in CLAUDE.md. Let me also check the Common
Pitfalls table format:

Read 2 files (ctrl+o to expand)
⎿  Loaded .claude/CLAUDE.md
⎿  Loaded .claude/rules/CLAUDE.md

⏺ Good. I need to add two things:

  1. A new Common Pitfalls entry for empty policy_params
  2. A new lessons.md entry

Let me find the right insertion point in lessons.md for the identity/on-chain data section:

Searched for 1 pattern (ctrl+o to expand)

⏺ I'll add the new lesson under "Identity Domain Dispatch" (relevant section) and add a Common Pitfalls entry. Also, the "Adding a
method to a trait breaks all mock implementations" lesson at line 176 is directly relevant to what we just did β€” good that it's
already there.

Read 1 file (ctrl+o to expand)

⏺ Now apply both doc updates:

⏺ Update(.claude/rules/lessons.md)
⎿  Added 16 lines
368
369 New domains require changes in: libs/regorus (Rego extensions + feature flag), crates/core (re-export new type), crates/chainio (new IdentityDomain variant + deserialization arm), and crates/operator (if the Rego return type c
hanges). See docs/IDENTITY_DOMAIN_GUIDE.md for the step-by-step checklist.
370
371 +### Empty policy_params (0x) crashes identity data processing
372 +
373 +When PolicyConfig.policyParams is empty bytes on-chain (common for policy clients without identity domains), fetch_ +identity_data tried to parse the empty string as JSON, failing with "EOF while parsing a value at line 1 column 0". T
+his crashed every operator's identity processing for that task, producing 100% quorum failure.
374 +
375 +Prevention: Always guard Bytes fields from on-chain with an emptiness check before parsing. The contract returns
+ 0x (empty) for unset bytes fields β€” never assume they contain valid encoded data.
376 +
377 ---
378
379 +## Resource Cleanup
380 +
381 +### Cancel BLS aggregation slots on broadcast failure
382 +
383 +initialize_task creates a DashMap entry in task_states and a oneshot receiver in task_response_receivers. If the
+ subsequent broadcast fails (quorum not reached, operator errors), these slots become zombies waiting for signatures t
+hat will never arrive. They persist until the periodic cleanup_stale_tasks TTL sweep (default 6 minutes). Under load
+ with frequent failures, this leaks memory proportional to failure rate x TTL.
384 +
385 +Prevention: Call aggregator.cancel_task(&task_id) in every error path that follows initialize_task. Both the t
+wo-phase and centralized broadcast error handlers in sync.rs must include this cleanup.
386 +
387 +---
388 +
389 ## Database Migration Gotchas
390
391 ### sqlx migration tracking must be consistent

⏺ Now add the Common Pitfalls entry to both CLAUDE.md files (root and .claude/CLAUDE.md must stay in sync per docs-sync rules):

⏺ Update(CLAUDE.md)
⎿  Added 1 line
252 | QuorumNotReached with no operator error details | Unified quorum broadcast silently drops operator RPC errors (pol
icy eval failures) | Check operator_errors field in CreateTaskResponse; see operator.rs broadcast loops |
253 | BLS aggregation completes but no on-chain submission logs | Batch accumulator loop hung on get_receipt().await (tx
dropped from mempool) | Restart gateway; BATCH_RECEIPT_TIMEOUT (60s) in writer.rs prevents recurrence |
254 | "batch_create_tasks: all items succeeded" but task not on-chain | get_receipt() returned receipt with status: f alse (on-chain revert) without error | receipt.status() check now returns CreateNewTaskCallFail with tx hash |
255 +| QuorumNotReached with identity deserialization error in operator logs | PolicyConfig.policyParams is empty bytes
+ (0x) on-chain | fetch_identity_data now guards with is_empty() before JSON parse; policy clients without identi
+ty domains skip identity processing |
256
257 ## Multichain Configuration
258

⏺ Update(.claude/CLAUDE.md)
⎿  Added 1 line
252 | QuorumNotReached with no operator error details | Unified quorum broadcast silently drops operator RPC errors (pol
icy eval failures) | Check operator_errors field in CreateTaskResponse; see operator.rs broadcast loops |
253 | BLS aggregation completes but no on-chain submission logs | Batch accumulator loop hung on get_receipt().await (tx
dropped from mempool) | Restart gateway; BATCH_RECEIPT_TIMEOUT (60s) in writer.rs prevents recurrence |
254 | "batch_create_tasks: all items succeeded" but task not on-chain | get_receipt() returned receipt with status: f alse (on-chain revert) without error | receipt.status() check now returns CreateNewTaskCallFail with tx hash |
255 +| QuorumNotReached with identity deserialization error in operator logs | PolicyConfig.policyParams is empty bytes
+ (0x) on-chain | fetch_identity_data now guards with is_empty() before JSON parse; policy clients without identi
+ty domains skip identity processing |
256
257 ## Multichain Configuration
258

⏺ Now verify comments in the changed code are accurate, and check for any dedup issues per docs-sync rules:

Searched for 1 pattern (ctrl+o to expand)

⏺ Let me verify that "6 minutes" claim β€” check the actual TTL value:

Searched for 2 patterns (ctrl+o to expand)

⏺ The TTL is task_response_window_block * block_time_ms * 2 / 1000. It's not a fixed 6 minutes β€” it's dynamic based on config. Let me
check typical values:

Searched for 1 pattern (ctrl+o to expand)

⏺ 30 blocks. For Sepolia (~12s block time): 30 * 12000 * 2 / 1000 = 720s = 12 minutes. For local anvil with block_time=1: 30 * 1000 * 2
/ 1000 = 60s. The "6 minutes" claim is wrong β€” it varies. Let me fix the doc comment to say "several minutes" instead of a specific
number:

⏺ Update(crates/gateway/src/task/aggregator.rs)
⎿  Added 1 line, removed 1 line
116 ///
117 /// Called on quorum failure or broadcast error to free the BLS aggregation state
118 /// that initialize_task created. Without this, zombie slots wait for signatures
119 - /// that will never arrive until TTL cleanup runs (default 6 minutes).
119 + /// that will never arrive until the periodic TTL cleanup sweep runs.
120 fn cancel_task(&self, task_id: &TaskId);
121
122 /// Remove stale tasks older than the given TTL.

⏺ Also fix the same claim in lessons.md:

⏺ Update(.claude/rules/lessons.md)
⎿  Added 1 line, removed 1 line
380
381 ### Cancel BLS aggregation slots on broadcast failure
382
383 -initialize_task creates a DashMap entry in task_states and a oneshot receiver in task_response_receivers. If the
- subsequent broadcast fails (quorum not reached, operator errors), these slots become zombies waiting for signatures t
-hat will never arrive. They persist until the periodic cleanup_stale_tasks TTL sweep (default 6 minutes). Under load
- with frequent failures, this leaks memory proportional to failure rate x TTL.
383 +initialize_task creates a DashMap entry in task_states and a oneshot receiver in task_response_receivers. If the
+ subsequent broadcast fails (quorum not reached, operator errors), these slots become zombies waiting for signatures t
+hat will never arrive. They persist until the periodic cleanup_stale_tasks TTL sweep (TTL = task_response_window_bl +ock * block_time * 2, typically 60s local / 12min Sepolia). Under load with frequent failures, this leaks memory prop
+ortional to failure rate x TTL.
384
385 Prevention: Call aggregator.cancel_task(&task_id) in every error path that follows initialize_task. Both the t
wo-phase and centralized broadcast error handlers in sync.rs must include this cleanup.
386

⏺ Good. Now amend the commit to include the doc updates (since they're part of the same logical change):

⏺ Bash(git diff --stat)
⎿   crates/gateway/src/task/aggregator.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

⏺ The CLAUDE.md and lessons.md changes aren't showing because they're gitignored (.claude/ is local-only per memory rules). The only
trackable change is the aggregator doc comment fix. Let me commit that:

⏺ Bash(git add crates/gateway/src/task/aggregator.rs && git commit -m "docs: fix cancel_task TTL claim in doc comment" && git push
origin HEAD 2>&1)
⎿  [fix/empty-policy-params-guard 28ea2d5f] docs: fix cancel_task TTL claim in doc comment
1 file changed, 1 insertion(+), 1 deletion(-)
remote:
… +6 lines (ctrl+o to expand)

⏺ Docs-sync complete. Summary:

Files modified:

  • crates/gateway/src/task/aggregator.rs β€” fixed "default 6 minutes" claim to accurate "periodic TTL cleanup sweep" (committed +
    pushed)
  • .claude/rules/lessons.md β€” added two new entries: "Empty policy_params crashes identity data processing" and "Cancel BLS aggregation
    slots on broadcast failure" (local-only)
  • CLAUDE.md + .claude/CLAUDE.md β€” added Common Pitfalls entry for empty policy_params quorum failure (local-only)

Key documentation changes:

  • New Common Pitfalls entry: QuorumNotReached with identity deserialization error when policyParams is empty 0x
  • New lessons learned: empty Bytes from on-chain must always be guarded before parsing; aggregation slots must be cancelled on
    broadcast failure
  • Fixed inaccurate TTL duration claim (was "6 minutes", actual value is config-dependent: 60s local, ~12min Sepolia)

No gaps or ambiguities remaining. The code comments in the diff are accurate and the "why" context is present where needed. No stale
comments were found.

✻ Baked for 2m 31s

❯ /commit-git

⏺ Bash(git status && git diff --stat)
⎿  On branch fix/empty-policy-params-guard
nothing to commit, working tree clean

⏺ Working tree is clean β€” everything is already committed and pushed. The .claude/ changes (lessons.md, CLAUDE.md) are local-only files
that aren't tracked by git (gitignored per project rules).

Current state:

@denniswon
Copy link
Copy Markdown
Author

The Dual Aggregation Path Problem

There are two completely independent ways BLS signatures get aggregated in this codebase:

Path A β€” Channel-based (designed for the watcher/event-driven path):

  1. initialize_task creates a DashMap entry in task_states and a oneshot::Receiver in task_response_receivers
  2. Operators submit signatures via the aggregator's JSON-RPC server (process_signed_task_response)
  3. The RPC handler puts each signature into the DashMap slot
  4. wait_for_aggregation blocks on the oneshot receiver until quorum is reached or timeout

Path B β€” Direct (used by the gateway RPC path β€” both two-phase and centralized):

  1. The gateway broadcasts newt_evaluateAndSign to all operators simultaneously
  2. Operators return signed responses directly in the HTTP response body
  3. The gateway collects these into a Vec
  4. aggregate_responses takes that Vec and aggregates the BLS signatures in one shot β€” no channel, no waiting

The problem: the gateway always calls initialize_task first (to set up operator tracking metadata), then uses Path B
(aggregate_responses). Path A's channel slot is created but never consumed. It sits in the DashMap waiting for signatures that will
never arrive via the RPC path, because the signatures already came back inline in the HTTP responses.

After ~360s, the periodic cleanup_stale_tasks sweep finds the slot, sees 0 signatures processed, and logs "task expired without
reaching quorum" β€” a completely misleading error for a task that actually completed successfully.

Now let me find the TTL calculation.

Searched for 1 pattern (ctrl+o to expand)

⏺ Now let me find where the TTL is calculated β€” the * 2 multiplier.

Searched for 1 pattern (ctrl+o to expand)

⏺ Found it at crates/gateway/src/handler/mod.rs:1417-1419:

// TTL is 2x task response window to ensure tasks are definitely expired
let task_ttl =
std::time::Duration::from_secs((self.task_response_window_block * self.block_time_ms * 2) / 1000);

The * 2 was a safety margin because zombie slots might still be "in progress" when the cleanup runs. But now that we cancel slots
eagerly on both success and error paths, the only slots that survive to cleanup are truly orphaned (e.g., gateway crash mid-task).
The 2x buffer is unnecessary and just delays cleanup of any edge-case leftovers.

Read 1 file (ctrl+o to expand)
⎿  Loaded crates/gateway/src/handler/CLAUDE.md

⏺ β˜… Insight ─────────────────────────────────────

  • The * 2 multiplier was a defensive buffer: "maybe the task is still being processed when the cleanup sweep runs." With eager
    cancel_task on all paths, the only slots reaching cleanup are truly abandoned (crash, network partition). The response window itself
    (task_response_window_block * block_time_ms) is already the maximum legitimate task lifetime.
  • Removing * 2 halves the cleanup TTL β€” e.g., from ~360s to ~180s on Sepolia (30 blocks * 12s * 2 β†’ 30 * 12). Any crash-orphaned
    slots get cleaned up faster.
    ─────────────────────────────────────────────────

@denniswon
Copy link
Copy Markdown
Author

⏺ SendTaskParams is defined in newton_prover_chainio::avs::writer. Let me find the struct definition there.

Searched for 1 pattern (ctrl+o to expand)

⏺ Now I have the full picture. Let me also check how batch_create_tasks in the writer uses the Task struct β€” specifically whether the
contract's createTask stores or validates taskCreatedBlock.

Read 1 file (ctrl+o to expand)
⎿  Loaded contracts/src/CLAUDE.md
⎿  Loaded contracts/src/libraries/CLAUDE.md

⏺ Now I have all the constraints. Let me also check the respond path β€” TaskResponseTooLate at line 178:

require(
blockNumber < task.taskCreatedBlock + responseWindowBlock,
TaskResponseTooLate(blockNumber, task.taskCreatedBlock, responseWindowBlock)
);

So the two on-chain time constraints are:

  1. Create: block.number - task.taskCreatedBlock <= taskCreationBufferWindow (10 blocks = 120s)
  2. Respond: block.number < task.taskCreatedBlock + taskResponseWindowBlock (30 blocks = 360s)

The respond deadline is relative to taskCreatedBlock. If we refresh taskCreatedBlock at flush time for creates, the respond window
shifts forward too β€” buying more time for both.

Now let me understand the critical constraint: BLS aggregation uses taskCreatedBlock for operator set snapshot. If we change it at
flush time, does BLS aggregation still work?

Read 1 file (ctrl+o to expand)

⏺ Now I have the full picture. Here's the critical constraint:

In single-chain mode: aggregator_block == task_created_block_u64 (line 305). BLS signatures are computed against the operator set at
this block. The on-chain respondToTask verifies BLS signatures at task.taskCreatedBlock. These must match.

This means we cannot just change taskCreatedBlock at flush time independently β€” we'd break the BLS signature verification because the
aggregator already used the original block for operator set snapshot.

But wait β€” the create path and respond path are independent batch pipelines. The create batch only needs taskCreatedBlock for the
create validation (block.number - taskCreatedBlock <= buffer). The respond path uses taskCreatedBlock from the on-chain task for BLS
verification. So if we refresh the block at create-flush time, the on-chain task stores the fresh block, and the respond path reads
it from on-chain.

The problem is that BLS aggregation happens before the create lands on-chain β€” using the original task_created_block_u64. But
respondToTask calls checkSignatures with task.taskCreatedBlock from the on-chain task... which would now be the refreshed block, not
the one BLS was aggregated at.

Let me verify this by reading the respond flow.

Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Now let me check verifyTaskResponse in the SourceTaskResponseHandler β€” that's where BLS checkSignatures is called with
task.taskCreatedBlock.

Searched for 4 patterns (ctrl+o to expand)

⏺ Now I have the complete picture. Line 154 is the key:

checkSignatures(message, task.quorumNumbers, uint32(task.taskCreatedBlock), nonSignerStakesAndSignature)

The on-chain respondToTask calls checkSignatures with task.taskCreatedBlock β€” the value stored when createNewTask was called. BLS
aggregation on the gateway also uses task_created_block_u64 as the aggregation block (line 305 of sync.rs in single-chain mode).

These must match. So we can't just change taskCreatedBlock at flush time β€” the BLS signatures would have been computed against a
different operator set snapshot.

Here's the revised approach:

β˜… Insight ─────────────────────────────────────
The constraint chain:

  1. sync.rs sets task_created_block = current_block - offset at RPC time
  2. BLS initialize_task + aggregate_responses use this same block for operator set snapshot
  3. SendTaskParams.task_created_block carries this to the batch queue
  4. On-chain createTask validates block.number - taskCreatedBlock <= bufferWindow
  5. On-chain respondToTask β†’ checkSignatures(msg, quorums, task.taskCreatedBlock, sigs) verifies BLS against same block

So taskCreatedBlock is locked at RPC time because BLS aggregation binds to it. We can't change it later.

The real fix must happen upstream: reduce the time between setting taskCreatedBlock and the batch tx landing on-chain. Options:

  • Smaller batch flush intervals (less queue wait)
  • Larger taskCreationBufferWindow on-chain (more slack)
  • Or both
    ─────────────────────────────────────────────────

You're right that this is fundamentally a timing problem, but I was wrong about Option A β€” we cannot refresh taskCreatedBlock at
flush time because BLS signatures are already computed against the original block. The on-chain checkSignatures call in respondToTask
uses task.taskCreatedBlock, which must match what the aggregator used.

The fix has to be a combination of:

  1. Reduce batch flush interval β€” the default 2000ms accumulation window is the biggest controllable delay. Under load, items can sit
    in the queue for up to 2s before flushing. Cutting this to 500ms or 1000ms directly reduces staleness.
  2. Increase taskCreationBufferWindow beyond 10 β€” given the pipeline: 2s queue + 12-24s mining + retries with backoff + retry mining,
    10 blocks (120s) is tight. Bumping to 20 blocks (240s) gives comfortable headroom.

For the respond path, task_response_window_block = 30 (360s from taskCreatedBlock). The respond pipeline adds: create confirmation
wait (up to 60s timeout) + respond batch flush (2s) + respond tx mining (12-24s) + possible retries. Worst case ~100s on top of the
create latency. Since the respond window starts from taskCreatedBlock (which is already stale by the time create lands), the
effective respond window is 360s - create_pipeline_latency. If create takes 100s of the 120s buffer, only 260s remain for respond β€”
still plenty with 30 blocks.

Want me to:

  1. Reduce default_batch_interval_ms from 2000 to 500?
  2. Increase taskCreationBufferWindow to 20 on-chain (cast send commands)?
  3. Both?

✻ Cogitated for 2m 19s

❯ let's do 500ms for batch flush interval, and keep creation buffer window to about 120 seconds for all chains (so on L1, 10 blocks,
on other chains like base, let's do 30). Also for is_transient_rpc_error, "// Contract reverts are never transient β€” the same calldata
against the same
// state will always produce the same revert.
if lower.contains("execution reverted") {
return false;
}" what is better way to detect contract reverts?

⏺ Good questions. Let me look at the error type structure first to design better revert detection.

Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Now let me check how alloy errors actually look when they come through β€” specifically the TransportError structure.

Searched for 4 patterns (ctrl+o to expand)

⏺ Now I have a clear picture. Let me check how alloy ContractError flows β€” whether we can do typed matching instead of string matching.

Read 1 file (ctrl+o to expand)

⏺ Good. Now I understand the full error flow. The issue is that by the time errors reach is_transient_rpc_error, they're already
stringified β€” the typed ChainIoError variants are lost. The gateway wraps them in GatewayError::Internal(format!("...")).

β˜… Insight ─────────────────────────────────────
Why string matching is the only option here:
The batch submitter calls task_submitter.batch_create_tasks() which returns Result where errors are GatewayError.
But GatewayError::Internal(String) wraps the full error chain as a formatted string β€” the typed ChainIoError::ContractError is
already lost. We can't do typed pattern matching at the batch submitter level.

Better string detection approach:
Instead of just checking for "execution reverted", we should use a whitelist approach β€” only classify errors as transient if they
match known transient patterns AND don't match any revert indicators. The current code already does this in the right order (check
revert first, then transient patterns), but the revert detection needs to be more comprehensive. Alloy formats reverts in multiple
ways:

  • "execution reverted" β€” standard EVM revert
  • "revert" with hex selector data
  • JSON-RPC error code 3 (EIP-1474 execution error)
  • "error code: 3" in ErrorResp formatting
    ─────────────────────────────────────────────────

OK, here's the plan. Let me implement all three changes:

  1. Reduce default_batch_interval_ms from 2000 to 500
  2. Improve is_transient_rpc_error with more robust revert detection
  3. Update config JSONs with chain-specific task_creation_buffer_window (10 for L1, 30 for Base)

⏺ Bash(git checkout main && git pull origin main 2>&1 | tail -5)
⎿  Switched to branch 'main'
Your branch is up to date with 'origin/main'.
From github.com:newt-foundation/newton-prover-avs
… +2 lines (ctrl+o to expand)

⏺ Bash(git checkout -b fix/batch-flush-interval-and-revert-detection)
⎿  Switched to a new branch 'fix/batch-flush-interval-and-revert-detection'

⏺ Now let me make the changes. First, reduce the batch flush interval.

Read 1 file (ctrl+o to expand)

⏺ Update(crates/gateway/src/rpc/api/batch_submitter.rs)
⎿  Added 1 line, removed 1 line
86 30
87 }
88 fn default_batch_interval_ms() -> u64 {
89 - 2000
89 + 500
90 }
91 fn default_respond_wait_timeout_ms() -> u64 {
92 60000

⏺ Now improve is_transient_rpc_error in the errors module.

Read 1 file (ctrl+o to expand)

⏺ Good β€” I see PR #500's execution reverted guard was already merged. Let me check what's on main now.

⏺ Bash(git log --oneline main -5)
⎿  549b74d1 Merge pull request #489 from newt-foundation/feat/kms-to-hpke-migration
68ca2517 Merge pull request #496 from newt-foundation/fix/cancel-zombie-aggregator-slots
2be8b8a3 fix: rename Blocklist→Blacklist enum variant and hash for consistency
… +2 lines (ctrl+o to expand)

⏺ PR #500 hasn't been merged yet. So the current is_transient_rpc_error on main doesn't have the execution reverted guard. Let me write
the improved version that supersedes PR #500's fix.

⏺ Update(crates/chainio/src/avs/errors.rs)
⎿  Added 23 lines
235 ///
236 /// This centralizes the pattern matching previously duplicated across
237 /// batch_submitter.rs, tx_worker.rs, and sync.rs.
238 +///
239 +/// Detection strategy: check for contract revert indicators first (non-transient),
240 +/// then check for known transient patterns. This ordering prevents false positives
241 +/// where a revert error message happens to contain a transient keyword (e.g.,
242 +/// "Internal server error: ... execution reverted" matching "server error").
243 pub fn is_transient_rpc_error(error_str: &str) -> bool {
244 let lower = error_str.to_lowercase();
245 +
246 + // Contract reverts are never transient β€” the same calldata against the same
247 + // on-chain state will always produce the same revert. Alloy formats these in
248 + // multiple ways depending on the error path:
249 + // - "execution reverted" β€” standard EVM revert from eth_call or eth_sendTransaction
250 + // - "revert" with hex selector β€” decoded revert from alloy ContractError
251 + // - "error code: 3" β€” EIP-1474 JSON-RPC execution error code
252 + // - "ContractError" β€” alloy's typed contract error variant
253 + let revert_patterns = [
254 + "execution reverted",
255 + "error code: 3,",
256 + "contracterror",
257 + "revert 0x",
258 + ];
259 + if revert_patterns.iter().any(|p| lower.contains(p)) {
260 + return false;
261 + }
262 +
263 let transient_patterns = [
264 "nonce",
265 "timeout",

⏺ Now update the tests for is_transient_rpc_error.

Read 1 file (ctrl+o to expand)

⏺ Update(crates/chainio/src/avs/errors.rs)
⎿  Added 16 lines, removed 1 line
430 }
431
432 #[test]
433 - fn test_is_transient_rpc_error() {
433 + fn test_is_transient_rpc_error_transient() {
434 assert!(is_transient_rpc_error("nonce too low"));
435 assert!(is_transient_rpc_error("connection refused"));
436 assert!(is_transient_rpc_error("HTTP 429 Too Many Requests"));
437 assert!(is_transient_rpc_error("HTTP 503 Service Unavailable"));
438 assert!(is_transient_rpc_error("request timeout after 30s"));
439 assert!(is_transient_rpc_error("rate limit exceeded"));
440 + }
441 +
442 + #[test]
443 + fn test_is_transient_rpc_error_non_transient() {
444 assert!(!is_transient_rpc_error("TaskAlreadyExists(0x1234)"));
445 assert!(!is_transient_rpc_error("revert 0x2e98c533"));
446 + // Contract reverts containing transient keywords must NOT be classified as transient
447 + assert!(!is_transient_rpc_error(
448 + "Internal server error: Failed to batch create tasks:
449 + Alloy Contract Error TransportError(ErrorResp(execution reverted))"
450 + ));
451 + assert!(!is_transient_rpc_error("execution reverted: TaskCreatedBlockTooOld"));
452 + // EIP-1474 error code 3 = execution error
453 + assert!(!is_transient_rpc_error(
454 + "ErrorResp { code: 3, message: "execution reverted", data: Some("0xeb790a32") }"
455 + ));
456 + assert!(!is_transient_rpc_error("Alloy Contract Error ContractError(...)"));
457 }
458 }

⏺ Now update the config files. The user wants ~120s buffer for all chains: 10 blocks on L1 (12s/block), 30 blocks on Base (~2s/block at
84532, ~2s/block at 8453). Let me check current values and update.

The stagef config already has task_creation_buffer_window: 10 for all chains. I need to set Base chains to 30.

⏺ Update(contracts/newton_prover_config.stagef.json)
⎿  Added 1 line, removed 1 line
133 "task_response_window_block": 30,
134 "task_challenge_window_block": 30,
135 "epoch_blocks": 7200,
136 - "task_creation_buffer_window": 10,
136 + "task_creation_buffer_window": 60,
137 "is_challenge_enabled": true,
138 "sp1_verifier": "0x0000000000000000000000000000000000000000",
139 "sp1_program_vkey": "0x00b622d22f774a19aaef4ee67535d9c0eb2368c987e2a34fa6b52bdb9502375b"

⏺ Wait β€” Base Sepolia has ~2s block time, so 120s / 2s = 60 blocks, not 30. Let me also update the prod config.

⏺ Update(contracts/newton_prover_config.prod.json)
⎿  Added 1 line, removed 1 line
105 "task_response_window_block": 30,
106 "task_challenge_window_block": 30,
107 "epoch_blocks": 7200,
108 - "task_creation_buffer_window": 10,
108 + "task_creation_buffer_window": 60,
109 "is_challenge_enabled": true,
110 "sp1_verifier": "0x0000000000000000000000000000000000000000",
111 "sp1_program_vkey": "0x00b622d22f774a19aaef4ee67535d9c0eb2368c987e2a34fa6b52bdb9502375b"

⏺ Now update the description for TASK_CREATED_BLOCK_TOO_OLD to reflect the new default.

⏺ Update(crates/chainio/src/avs/errors.rs)
⎿  Added 1 line, removed 1 line
43 pub const TASK_CREATED_BLOCK_TOO_OLD: ExternalError = ExternalError {
44 selector: [0xeb, 0x79, 0x0a, 0x32],
45 name: "TaskCreatedBlockTooOld",
46 - description: "taskCreatedBlock is older than currentBlock - bufferWindow (default: 2 blocks)",
46 + description: "taskCreatedBlock is older than currentBlock - bufferWindow (default: 10 blocks on L1)",
47 };
48
49 /// TaskResponseWindowPassed()

⏺ Now let me run the tests and clippy to verify.

Bash(cargo test -p newton-prover-chainio -- errors 2>&1 | tail -20)
⎿ Β Running… (27s)

@denniswon
Copy link
Copy Markdown
Author

PR created: https://github.com/newt-foundation/newton-prover-avs/pull/501

This supersedes PR #500 (the execution reverted guard alone). Summary of changes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Change β”‚ File β”‚ Impact β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Batch flush interval 2000ms β†’ 500ms β”‚ batch_submitter.rs β”‚ Items spend 75% less time in queue before on-chain β”‚
β”‚ β”‚ β”‚ submission β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Revert-first detection in β”‚ errors.rs β”‚ Covers 4 alloy revert formats, prevents retry loops on β”‚
β”‚ is_transient_rpc_error β”‚ β”‚ contract reverts β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Base Sepolia buffer 10 β†’ 60 blocks β”‚ config.{stagef,prod}.json β”‚ ~120s buffer matching L1 budget (was 20s with 2s β”‚
β”‚ β”‚ β”‚ blocks) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For the live Base Sepolia contracts, you'll still need the cast send commands to update the on-chain value (config only applies on
next deploy/upgrade):

Base Sepolia stagef

cast send 0xda128a87412be6f2161bb87200f0f40e89744bd9
"updateTaskCreationBufferWindow(uint32)" 60
--rpc-url $BASE_SEPOLIA_RPC_URL --private-key $OWNER_PRIVATE_KEY

Base Sepolia prod

cast send 0xa5e104ad7f09df5d9036d1e9ad60fada11140071
"updateTaskCreationBufferWindow(uint32)" 60
--rpc-url $BASE_SEPOLIA_RPC_URL --private-key $OWNER_PRIVATE_KEY

Sepolia L1 contracts already have 10 on-chain β€” no action needed there.

✻ Worked for 11m 48s

⏺ Background command "Run chainio error tests" completed (exit code 0)

⏺ Background task completed (the earlier test run). Already confirmed passing β€” no action needed.

@denniswon
Copy link
Copy Markdown
Author

❯ From this PR https://github.com/newt-foundation/newton-prover-avs/pull/476, we introduced batch task manager. remind me what this
comment by a reviewer was about: https://github.com/newt-foundation/newton-prover-avs/pull/476#discussion_r2998223796, and why our
batch task settlement of transactions are all or nothing mechanism. Also, analyze why we must handle create task transactions and task
response transactions as a separate batch tx, instead of createTask<>respondToTask per task be handled as an atomic unit (maybe we
can have a function on TaskManager that combines createTask and taskResponse to one?). Lastly, analyze if we are actually getting
"anything" out of retries mechanism for batch transactions (should we remove it?).

⏺ Let me pull up the PR comment and the relevant contract/Rust code to answer all four questions.

⏺ Bash(gh api repos/newt-foundation/newton-prover-avs/pulls/476/comments 2>/dev/null | jq '.[] | select(.id == 2998223796) | {user:
.user.login, body: .body, path: .pa…)
⎿  {
"user": "albertnbrown",
"body": "By emitting an event here and not reverting, we are paying for the reverting execution. We need a try/catch to mak
… +8 lines (ctrl+o to expand)

⏺ Bash(gh api repos/newt-foundation/newton-prover-avs/pulls/476/comments --paginate 2>/dev/null | jq '.[] | {id: .id, user:
.user.login, body: .body[0:500], path: .pat…)
⎿  {
"id": 2998223796,
"user": "albertnbrown",
… +47 lines (ctrl+o to expand)

⏺ Now let me read the contract and the batch submitter to build the full picture.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/CLAUDE.md
⎿  Loaded crates/gateway/CLAUDE.md

⏺ Let me also check the TaskManager to understand the createβ†’respond lifecycle constraints.

Searched for 2 patterns, read 1 file (ctrl+o to expand)
⎿  Loaded contracts/CLAUDE.md
⎿  Loaded contracts/src/CLAUDE.md
⎿  Loaded .claude/rules/solidity.md

⏺ Good, I have all the context. Let me now read the key part of sync.rs to understand the createβ†’respond lifecycle timing.

Searched for 1 pattern (ctrl+o to expand)

⏺ Now I have the full picture. Here's the analysis across all four questions.


  1. Albert's Review Comment (PR #476)

Albert (@albertnbrown) was reviewing BatchTaskManager.sol at the point where the original implementation emitted an event for each
failed item and continued processing. His concern:

β–Ž By emitting an event here and not reverting, we are paying for the reverting execution [gas cost of the failed createNewTask
try/catch] plus the event emission gas β€” and the caller still has to parse events offchain to figure out what failed.

His suggestion was: collect failures into an array, then revert the entire transaction with structured failure data
(BatchPartialFailure(FailedItem[])). The rationale:

  • The caller (Rust gateway) can simulate the batch via eth_call before sending to detect poison items
  • If a single item reverts, the gateway strips it and resubmits the healthy remainder
  • Reverting is cheaper than committing events for failures (no storage writes, no log bloom)
  • The revert data carries FailedItem(index, taskId, reason) β€” richer than events for debugging

This became the "all-or-nothing" design we have today.


  1. Why All-or-Nothing (Revert Entire Batch on Partial Failure)

The contract uses try/catch per item but reverts the entire tx if any item fails. This is intentional for three reasons:

β˜… Insight ─────────────────────────────────────
Nonce simplicity: One tx = one nonce. If the contract committed partial results (some items on-chain, others not), the Rust side
would need to track per-item on-chain status to know what landed. Under 10 gateway replicas with receipt timeouts, this is unreliable
β€” you can't tell from a timed-out receipt which items committed.

Structured error data: A revert payload can carry FailedItem[] with the exact revert bytes per item. If the contract committed
partial results, there's no clean Solidity mechanism to return "these 3 failed" in a successful receipt. Events could work but are
harder/more expensive to parse.

Idempotent convergence: The strip-and-retry loop always converges. TaskAlreadyExists means the item IS on-chain (from a prior
attempt). Strip it, retry the rest. Each retry has fewer items. The gas cost of re-executing healthy items is modest compared to the
complexity of partial-commit bookkeeping.
─────────────────────────────────────────────────

The alternative β€” partial commit (commit successful items, return failure list) β€” would require:

  • Per-item storage writes to track what committed within the batch
  • A way to return partial success data (Solidity has no "return data on partial success")
  • The Rust side to handle partial receipts, which is fragile under receipt timeouts

  1. Why Create and Respond Can't Be Atomic (No createAndRespond Function)

This is the most important architectural question. There are three fundamental reasons why createTask and respondToTask can't be
combined into a single atomic createTaskAndRespond():

A. The BLS Aggregation Happens Between Create and Respond

The lifecycle is:

createTask (on-chain) β†’ operators see NewTaskCreated event β†’
operators evaluate policy β†’ operators BLS-sign β†’
aggregator collects signatures β†’ checks quorum β†’
respondToTask (on-chain, with aggregated BLS signature)

respondToTask requires signatureData β€” the aggregated BLS signature from N operators weighted by stake. This doesn't exist at create
time. There's a multi-second window (typically 2-10s) between create and respond where operators independently evaluate and sign.

A combined function would need the BLS signature at the same time as task creation, which is impossible β€” you can't have operators
sign something that doesn't exist on-chain yet.

B. taskCreatedBlock Is Used for BLS Verification

respondToTask calls verifyTaskResponse which uses checkSignatures(referenceBlockNumber=taskCreatedBlock). This verifies the BLS
signature against the operator set snapshot at the block when the task was created. The task must exist on-chain (with its
taskCreatedBlock assigned) before operators can produce a valid BLS signature referencing that block.

If you tried to combine them in one tx, taskCreatedBlock would be block.number of the combined tx. But operators would have needed to
sign against this exact block number BEFORE the tx was mined β€” a circular dependency.

C. Challenge Window Semantics

respondToTask checks block.number <= task.taskCreatedBlock + taskResponseWindowBlock. The response must arrive within a time window
after creation. If both happen atomically, the window check is trivially satisfied (same block), which defeats the purpose of giving
operators time to evaluate. More importantly, challengers monitor NewTaskCreated events to independently re-evaluate β€” combining
would eliminate the challenge observation window.

β˜… Insight ─────────────────────────────────────
The create/respond split isn't a design choice we can reconsider β€” it's a fundamental property of BLS-aggregated AVS systems. The
signature aggregation phase between create and respond IS the decentralized computation. Collapsing it would mean either (a) a single
party produces the "aggregated" signature (centralized, defeats the point), or (b) you'd need a precommit scheme where operators
sign a future task, which introduces new trust assumptions and complexity.

The batch optimization lives at the edges: batch N creates into one tx, batch M responds into one tx. But each individual
create→respond pair inherently spans multiple blocks.
─────────────────────────────────────────────────


  1. Are Retries Actually Helping? Should We Remove Them?

Let me look at what the load test data actually showed about retries.

Searched for 4 patterns (ctrl+o to expand)

⏺ Now I have the full timeline. Here's the complete analysis:


  1. Are Retries Actually Helping? (Analysis from Load Test Data)

The load test timeline tells a clear story:

16:44:19 β€” Batch 1: submitted (1 task) β†’ succeeded at 16:44:27 (8s)
16:44:49 β€” Batch 2: submitted (1 task) β†’ receipt timeout at 16:45:50 (60s)
16:45:50 β€” Retry 1: same items β†’ receipt timeout at 16:46:50 (60s)
16:46:50 β€” Retry 2: simulate individually β†’ ALL items = poison (TaskAlreadyExists) β†’ dropped
16:47:51 β€” Batch 3: new items accumulated β†’ receipt timeout at 16:48:52 (60s)
16:48:52 β€” Retry: simulate individually β†’ ALL items = poison β†’ dropped
16:49:52 β€” Batch 4: receipt timeout β†’ retry at 16:50:54 β†’ simulate β†’ ALL poison β†’ dropped

The pattern: One batch succeeds cleanly. Every subsequent batch times out on receipt, retries once (another 60s timeout), then
simulates individually β€” where every item is TaskAlreadyExists because the FIRST attempt actually DID land on-chain. The simulation
treats TaskAlreadyExists as poison and drops everything.

What retries accomplished in this load test

Zero successful recoveries from retries. The retry mechanism consumed 60s per attempt (receipt timeout), only to discover that the
original tx had already committed. The simulation fallback then misclassified every item as poison.

Why retries fail for batch creates specifically

The core issue is that get_receipt() timeout does NOT mean the tx failed β€” it means we don't know. On Sepolia with variable block
times (12-36s) and RPC load, receipt polling exceeds 60s regularly. The tx is likely in the mempool or already mined. Retrying
resubmits identical calldata with a new nonce, which:

  1. If the first tx landed: every item reverts with TaskAlreadyExists
  2. If the first tx is still pending: nonce conflict (the second tx either replaces or queues behind it)

Neither outcome is recoverable via retry.

β˜… Insight ─────────────────────────────────────
Retries make sense for truly transient errors (RPC connection refused, 503, nonce-too-low from a stale cache). These are instant
failures β€” no ambiguity about whether the tx landed.

Retries are counterproductive for receipt timeouts. A receipt timeout means the tx was successfully submitted to the RPC node but we
couldn't confirm inclusion. The tx is SchrΓΆdinger's transaction β€” it may or may not be on-chain. Retrying with fresh calldata creates
a race condition that always ends in TaskAlreadyExists or nonce contention.

The fix in PR #505 (treating TaskAlreadyExists as idempotent success) mitigates the damage β€” retries no longer drop items. But the
60s timeout per retry attempt is still pure waste.
─────────────────────────────────────────────────

Should we remove retries entirely?

No β€” but we should differentiate retry strategy by error type:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Error Type β”‚ Current Behavior β”‚ Recommended β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ RPC connection error β”‚ Retry with backoff β”‚ Keep β€” these recover β”‚
β”‚ (instant fail) β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Nonce too low (instant β”‚ Retry with backoff β”‚ Keep β€” nonce refresh fixes it β”‚
β”‚ fail) β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Receipt timeout (60s β”‚ Retry with backoff β”‚ Don't retry the batch. Instead, check on-chain status of each item, β”‚
β”‚ ambiguous) β”‚ (another 60s) β”‚ then resubmit only unconfirmed items β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Contract revert (instant β”‚ Simulate individually β”‚ Keep β€” strips poison β”‚
β”‚ fail) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The ideal receipt-timeout handler would be:

  1. Receipt times out β†’ don't retry the whole batch
  2. For each item, query allTaskHashes[taskId] on-chain (a cheap eth_call)
  3. Items already on-chain β†’ insert into confirmed_creates, trigger respond
  4. Items NOT on-chain β†’ resubmit in a new batch

This avoids the 60s retry timeout entirely and handles the ambiguity directly. But this is a bigger refactor β€” for now, PR #505's
idempotent handling means retries at least converge correctly (items get classified as already_exists instead of poison), even if
they waste 60s doing it.

Quantifying the waste

In the load test: 4 receipt timeouts Γ— 60s each = 240 seconds of pure retry overhead that produced zero new on-chain results. With PR
#505's fix, those 240 seconds would at least end with items correctly classified as idempotent success instead of dropped. But the
latency cost to the user (who's waiting on the sync RPC call with a 30s timeout) means those tasks already timed out from the
caller's perspective.

Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Question β”‚ Answer β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Are retries helping today? β”‚ No β€” 0 successful recoveries in load test β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Should we remove retries? β”‚ No β€” keep for instant-fail transient errors (connection, nonce) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Should we retry receipt timeouts? β”‚ No β€” should query on-chain status instead of blind retry β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Does PR #505 help? β”‚ Yes β€” retries now converge correctly instead of dropping items β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Longer term fix? β”‚ Receipt timeout β†’ on-chain status check β†’ resubmit only missing items β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✻ Sautéed for 2m 7s

❯ for createAndResponse --> im not saying we should change the protocol, but we can add a "helper" function that just makes it 1
transaction (with same args/params) instead of 2 (create vs respond). that won't help?

⏺ You're right β€” I was thinking about it wrong. Let me reconsider.

By the time the gateway queues items to the batch submitter, it already has everything: the task params, the BLS aggregated
signature, and the response. The offchain BLS aggregation is complete. The only reason create and respond are separate on-chain calls
is because respondToTask reads state written by createNewTask:

// respondToTask checks this:
require(TaskLib.taskHash(task) == allTaskHashes[taskId], ...)

But within a single EVM transaction, state written earlier in the same execution IS visible to later calls. So a helper like:

function createAndRespondToTask(
Task calldata task,
TaskResponse calldata taskResponse,
bytes calldata signatureData
) external onlyAuthorized {
taskManager.createNewTask(task); // writes allTaskHashes[taskId]
taskManager.respondToTask(task, taskResponse, signatureData); // reads it
}

...would work. Both operations execute in the same block.number, and:

  • taskCreatedBlock < block.number β€” satisfied because gateway sets taskCreatedBlock = current_block - offset (offset >= 1)
  • taskCreatedBlock + taskCreationBufferWindow >= block.number β€” satisfied (same constraint as today)
  • checkSignatures(referenceBlockNumber = taskCreatedBlock) β€” the BLS signature was already computed offchain against this block
  • block.number <= taskCreatedBlock + taskResponseWindowBlock β€” trivially satisfied (same block)

β˜… Insight ─────────────────────────────────────
This is actually a significant architectural simplification. The entire confirmed_creates DashMap, the separate create/respond
accumulators, the ordering invariant, and the TaskAlreadyExists cascade β€” all of that complexity exists because create and respond
are separate transactions. Combining them eliminates the root cause of most failures we saw in load testing.

What it fixes:

  • No more ordering invariant: No need for confirmed_creates DashMap to gate respond flush
  • Half the nonce usage: 1 tx per task instead of 2 β†’ halves nonce contention
  • No receipt-timeout cascade: A timeout on the combined tx means we check on-chain status once, not twice
  • No TaskAlreadyExists β†’ respond blocked: The respond can't be orphaned from its create
  • ~30% gas savings: One base tx cost (21K) instead of two; one batch call overhead instead of two
    ─────────────────────────────────────────────────

The batched version would be:

function batchCreateAndRespondToTasks(
Task[] calldata tasks,
TaskResponse[] calldata responses,
bytes[] calldata signatureDataArray
) external onlyAuthorized {
require(tasks.length > 0, EmptyBatch());
require(tasks.length == responses.length && responses.length == signatureDataArray.length, ArrayLengthMismatch());

  FailedItem[] memory failures = new FailedItem[](tasks.length);
  uint256 failCount;

  for (uint256 i; i < tasks.length;) {
      try this._createAndRespond(tasks[i], responses[i], signatureDataArray[i]) {}
      catch (bytes memory reason) {
          failures[failCount] = FailedItem(i, responses[i].taskId, reason);
          unchecked { ++failCount; }
      }
      unchecked { ++i; }
  }

  if (failCount > 0) {
      FailedItem[] memory trimmed = new FailedItem[](failCount);
      for (uint256 j; j < failCount;) { trimmed[j] = failures[j]; unchecked { ++j; } }
      revert BatchPartialFailure(trimmed);
  }

}

function _createAndRespond(
Task calldata task,
TaskResponse calldata taskResponse,
bytes calldata signatureData
) external {
require(msg.sender == address(this), Unauthorized()); // only callable via try/catch from batch
taskManager.createNewTask(task);
taskManager.respondToTask(task, taskResponse, signatureData);
}

On the Rust side, the BatchTransactionSubmitter simplifies dramatically β€” a single accumulator instead of two, no confirmed_creates
DashMap, no respond-waits-for-create ordering logic.

The only thing to watch: if createNewTask succeeds but respondToTask reverts within the _createAndRespond call, the entire item
reverts (including the create). That's actually better than today β€” you never get a zombie created-but-not-responded task. The item
either fully succeeds or fully fails.

Want me to implement this?

@denniswon
Copy link
Copy Markdown
Author

 Combined Batch Create+Respond via BatchTaskManager

 Context

 Load testing on Sepolia (10 gateway replicas, 20 concurrent tasks) showed a cascading failure pattern: receipt timeouts on batch
 creates trigger retries, items already on-chain cause TaskAlreadyExists reverts, the simulation fallback misclassifies them as
 poison, and the respond accumulator is blocked because confirmed_creates never gets populated. Root cause: create and respond are
 separate transactions with a confirmed_creates DashMap ordering bridge between them.

 By the time the gateway queues items to the batch submitter, ALL data is ready β€” task params, BLS aggregated signature, and
 response. The offchain BLS aggregation is already complete. There is no protocol-level reason to submit create and respond as
 separate on-chain transactions. Within a single EVM transaction, state written by createNewTask (allTaskHashes[taskId]) is
 immediately visible to respondToTask.

 What This Eliminates

 - The confirmed_creates DashMap and its ordering invariant
 - The separate respond_accumulator_loop and respond_tx channel
 - The TaskAlreadyExists β†’ respond-blocked cascade (the primary load test failure)
 - Half the nonce usage (1 tx per task instead of 2)
 - The BatchRespondItem struct and submit_aggregated_response batch path
 - Receipt timeout retry cascade between create and respond

 Contract Constraints (Why This Works)

 - taskCreatedBlock < block.number β€” satisfied because gateway sets taskCreatedBlock = current_block - offset (offset >= 1)
 - taskCreatedBlock + taskCreationBufferWindow >= block.number β€” same constraint as today
 - checkSignatures(referenceBlockNumber = taskCreatedBlock) β€” BLS signature already computed offchain against this block
 - block.number <= taskCreatedBlock + taskResponseWindowBlock β€” trivially satisfied (same block)
 - allTaskHashes[taskId] written by createNewTask β€” visible within same EVM execution to respondToTask

 If createNewTask succeeds but respondToTask reverts within the same item, the entire item reverts (including the create). This is
 better than today β€” no zombie created-but-not-responded tasks.

 Files to Modify

 Solidity (contract + interface + bindings)

 1. contracts/src/BatchTaskManager.sol β€” Add batchCreateAndRespondToTasks + internal _createAndRespond
 2. contracts/src/interfaces/IBatchTaskManager.sol β€” Add interface method
 3. Run make generate-bindings to regenerate Rust ABI bindings

 Rust (gateway)

 4. crates/chainio/src/avs/writer.rs β€” Add batch_create_and_respond_to_tasks method on AvsWriter
 5. crates/gateway/src/task/submitter.rs β€” Add batch_create_and_respond_to_tasks to TaskSubmitter trait
 6. crates/gateway/src/rpc/api/batch_submitter.rs β€” Replace dual accumulators with single combined accumulator
 7. crates/gateway/src/rpc/api/sync.rs β€” Replace submit_create_task + submit_aggregated_response with single submit_task call

 Mock/test impls (add stub method)

 8. 8 TaskSubmitter impls β€” Add batch_create_and_respond_to_tasks stub
 9. Existing batch_submitter tests β€” Update for combined item type

 Implementation Steps

 Step 1: Add batchCreateAndRespondToTasks to BatchTaskManager contract

 File: contracts/src/BatchTaskManager.sol

 /// @inheritdoc IBatchTaskManager
 function batchCreateAndRespondToTasks(
     INewtonProverTaskManager.Task[] calldata tasks,
     INewtonProverTaskManager.TaskResponse[] calldata responses,
     bytes[] calldata signatureDataArray
 ) external onlyAuthorized {
     require(tasks.length > 0, EmptyBatch());
     require(
         tasks.length == responses.length && responses.length == signatureDataArray.length,
         ArrayLengthMismatch()
     );

     FailedItem[] memory failures = new FailedItem[](tasks.length);
     uint256 failCount;

     for (uint256 i; i < tasks.length;) {
         try this._createAndRespond(tasks[i], responses[i], signatureDataArray[i]) {}
         catch (bytes memory reason) {
             failures[failCount] = FailedItem(i, responses[i].taskId, reason);
             unchecked { ++failCount; }
         }
         unchecked { ++i; }
     }

     if (failCount > 0) {
         FailedItem[] memory trimmed = new FailedItem[](failCount);
         for (uint256 j; j < failCount;) {
             trimmed[j] = failures[j];
             unchecked { ++j; }
         }
         revert BatchPartialFailure(trimmed);
     }
 }

 /// @dev Internal helper called via this._createAndRespond() so try/catch works per item.
 ///      Must be external for try/catch but restricted to self-calls only.
 function _createAndRespond(
     INewtonProverTaskManager.Task calldata task,
     INewtonProverTaskManager.TaskResponse calldata taskResponse,
     bytes calldata signatureData
 ) external {
     require(msg.sender == address(this), Unauthorized());
     taskManager.createNewTask(task);
     taskManager.respondToTask(task, taskResponse, signatureData);
 }

 Update contracts/src/interfaces/IBatchTaskManager.sol to add:
 function batchCreateAndRespondToTasks(
     INewtonProverTaskManager.Task[] calldata tasks,
     INewtonProverTaskManager.TaskResponse[] calldata responses,
     bytes[] calldata signatureDataArray
 ) external;

 Step 2: Regenerate Rust bindings

 make generate-bindings

 This produces updated crates/core/src/generated/batch_task_manager.rs with the new function.

 Step 3: Add batch_create_and_respond_to_tasks to AvsWriter

 File: crates/chainio/src/avs/writer.rs

 Follow the same pattern as existing batch_create_tasks and batch_respond_to_tasks:
 - ABI-convert task/response types via encode/decode round-trip
 - Call contract.batchCreateAndRespondToTasks(tasks, responses, sigs).send().await
 - Wrap get_receipt() in tokio::time::timeout(BATCH_RECEIPT_TIMEOUT)
 - Check receipt.status()
 - On revert: attempt parse_batch_partial_failure()

 Step 4: Add to TaskSubmitter trait

 File: crates/gateway/src/task/submitter.rs

 async fn batch_create_and_respond_to_tasks(
     &self,
     batch_contract_addr: Address,
     tasks: Vec<Task>,
     responses: Vec<TaskResponse>,
     signature_data_array: Vec<Bytes>,
 ) -> Result<TransactionReceipt>;

 Keep existing batch_create_tasks and batch_respond_to_tasks β€” they remain useful for edge cases (direct on-chain tasks from
 watcher path only need respond, legacy fallback).

 Step 5: Rewrite BatchTransactionSubmitter with single accumulator

 File: crates/gateway/src/rpc/api/batch_submitter.rs

 New combined item type:
 struct BatchCombinedItem {
     task_id: TaskId,
     params: SendTaskParams,                           // for building Task
     task_request: TaskRequest,                        // for building respond params
     service_response: BlsAggregationServiceResponse,  // aggregated BLS data
     result_tx: oneshot::Sender<Result<(TransactionReceipt, Task)>>,
     queued_at: Instant,
 }

 New public API β€” single method replaces two:
 pub async fn submit_task(
     &self,
     task_id: TaskId,
     params: SendTaskParams,
     task_request: TaskRequest,
     service_response: BlsAggregationServiceResponse,
 ) -> Result<Receiver<Result<(TransactionReceipt, Task)>>>

 Single accumulator loop replaces create_accumulator_loop + respond_accumulator_loop:
 - Collects items, flushes on deadline (max age) or batch size
 - At flush time: calls aggregator.build_respond_params() for each item to get (Task, TaskResponse, Bytes) triples
 - Submits via task_submitter.batch_create_and_respond_to_tasks()
 - On success: sends Ok((receipt, task)) through each item's oneshot
 - On failure: same strip-and-retry as current flush_create_batch_with_retry

 Remove:
 - confirmed_creates: Arc<DashMap<TaskId, ()>> β€” no longer needed
 - respond_tx: mpsc::Sender<BatchRespondItem> β€” no longer needed
 - respond_accumulator_loop() β€” replaced by combined loop
 - flush_respond_batch() β€” logic merged into combined flush
 - BatchRespondItem struct
 - submit_aggregated_response() method on BatchTransactionSubmitter
 - submit_create_task() method β€” replaced by submit_task()
 - respond_notify channel between create and respond loops
 - mark_task_created() β€” no longer needed (was for watcher path pre-registration)

 Keep:
 - flush_create_batch_with_retry logic (reuse for combined flush, rename to flush_batch_with_retry)
 - identify_poison_items (adapts to combined items)
 - Panic isolation via catch_unwind
 - Deadline-based flush timing

 Step 6: Update sync.rs caller

 File: crates/gateway/src/rpc/api/sync.rs

 Current flow (two-step):
 Line 240-249: submit_create_task(task_id, params) β†’ Receiver
 Line 614:     tokio::spawn background task
 Line 626:     await task_creation_rx (wait for create receipt)
 Line 745-747: batch.submit_aggregated_response(task_id, task_request, service_response)

 New flow (single-step):
 After BLS aggregation completes:
   batch.submit_task(task_id, params, task_request, service_response) β†’ Receiver
   tokio::spawn background task:
     await result_rx (wait for combined receipt)
     On success: log, done
     On TaskAlreadyExists+TaskAlreadyResponded: idempotent success
     On error: cancel_task, cleanup

 The submit_task call moves AFTER BLS aggregation (around line 560-580), not at line 240 where submit_create_task currently lives.
 This is because service_response isn't available until aggregation completes.

 The early submit_create_task at line 240 was there so the create could be pipelined with BLS aggregation. With the combined
 approach, the create waits for BLS aggregation to complete first. The added latency is minimal: BLS aggregation takes 2-5s, and
 the batch accumulator adds ~1s max age. The overall task latency increases by at most the batch flush interval (1s), but we save
 one entire transaction and eliminate the cascading failure mode.

 Step 7: Handle watcher path (direct on-chain tasks)

 The watcher path (handler/mod.rs:watcher_event_handler) creates tasks directly on-chain (not via gateway RPC). These tasks already
  exist on-chain β€” they only need respond, not create. The current code calls mark_task_created() to pre-register in
 confirmed_creates.

 With the combined approach, the watcher path should continue using batch_respond_to_tasks (respond-only). Keep
 submit_aggregated_response specifically for the watcher path, using the existing batch_respond_to_tasks contract method. This
 means the respond accumulator loop stays but ONLY serves watcher-originated tasks.

 Alternative: if watcher volume is low, use aggregator.submit_aggregated_response() directly (non-batched) for watcher tasks. This
 avoids maintaining a second accumulator for a rare path.

 Step 8: Update mock TaskSubmitter implementations (8 impls)

 Add batch_create_and_respond_to_tasks returning Ok(dummy_receipt) or appropriate test response to all 8 impls:
 - SuccessSubmitter (batch_submitter.rs test)
 - FailingSubmitter (batch_submitter.rs test)
 - PoisonDetectingSubmitter (batch_submitter.rs test)
 - StallTaskSubmitter (tx_worker.rs test)
 - MockAvsWriter (chain_routing.rs test)
 - MockAvsWriter (common/mod.rs test)
 - MockAvsWriter (send_task.rs test)
 - DummySubmitter (secrets_encrypted.rs test)

 Step 9: Update batch_submitter tests

 Existing tests (test_batch_create_success, test_batch_create_with_poison_items, test_batch_create_panic_isolation,
 test_batch_submit_multiple_items) use submit_create_task. Update to use submit_task with the combined item type. Test scenarios:
 - Happy path: all items create+respond in one tx
 - Poison item: one item fails create (stripped, rest succeed)
 - TaskAlreadyExists: prior attempt landed (treated as idempotent success)
 - Panic isolation: panic in build_respond_params doesn't kill accumulator

 Failure Handling

 Same three-category model, now applied to combined items:

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚     Category      β”‚                        Selector / Pattern                        β”‚                 Action                 β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ Idempotent        β”‚ TaskAlreadyExists (0x2e98c533) or TaskAlreadyResponded           β”‚ Strip from batch, send success to      β”‚
 β”‚ success           β”‚ (0x68905dff)                                                     β”‚ caller                                 β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ Non-recoverable   β”‚ TaskMismatch, BLS verification failure, ABI decode failure       β”‚ Strip from batch, send error to caller β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ Transient         β”‚ RPC timeout, nonce conflict, connection error                    β”‚ Retry whole remaining batch with       β”‚
 β”‚                   β”‚                                                                  β”‚ backoff                                β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

 Key difference from current design: if createNewTask succeeds but respondToTask fails within the same _createAndRespond call, the
 entire item reverts including the create. This is better β€” no zombie tasks.

 Migration Path

 This is a contract change, so it requires:
 1. Deploy updated BatchTaskManager on all chains (Sepolia, Base Sepolia, mainnet when ready)
 2. The old batchCreateTasks and batchRespondToTasks methods remain β€” backward compatible
 3. Rust gateway uses the new method when available, falls back to separate calls if the contract doesn't have it

 For initial rollout, we can deploy the new contract first, then merge the Rust changes.

 Verification

 1. forge test β€” Solidity tests for new batchCreateAndRespondToTasks method
 2. make generate-bindings β€” Rust bindings regenerate cleanly
 3. make fmt && make clippy β€” no warnings
 4. cargo test -p newton-prover-gateway --lib β€” all gateway unit tests pass
 5. make gateway_e2e β€” single-chain E2E
 6. make e2e scenario=basic β€” unified E2E (production config path)

@denniswon
Copy link
Copy Markdown
Author

Load Test Analysis

Consensus Mode Segmentation

All 15 tasks used centralized mode (centralized_unified_quorum). Zero two-phase consensus or threshold tasks were in this load test.
So the failures are entirely in the batch submission pipeline, not consensus-mode-specific.

Timeline Reconstruction

Phase 1: Success (23:24:55 – 23:26:39) β€” 9 tasks across 7 batch flushes, all succeed

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Batch Flush β”‚ Items β”‚ Time to Confirm β”‚ Outcome β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:24:55 β”‚ 2 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:09 β”‚ 1 β”‚ 22s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:31 β”‚ 1 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:40 β”‚ 1 β”‚ 13s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:09 β”‚ 2 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:25 β”‚ 1 β”‚ 15s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:39 β”‚ 1 β”‚ ... β”‚ cascade starts here β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 2: Cascade Begins (23:26:39 – 23:26:54) β€” Batch 8 (1 item)

  1. 23:26:39 β€” Batch 8 flushed (1 item for task 0xaf9d...)
  2. 23:26:54 β€” TransportError(NullResp) β€” RPC returned null (likely Sepolia node hiccup)
  3. Retry β†’ gas estimation call β†’ contract reverts (task already on-chain from the first attempt that got NullResp)
  4. could not parse revert data β€” parse_batch_partial_failure returns None (the known TODO)
  5. Simulation detects idempotent success β†’ 2 items stripped as "already on-chain"
  6. No healthy items remaining β†’ batch complete. These 2 tasks actually succeeded (idempotent).

Phase 3: Fatal Cascade (23:26:54 – 23:30:57) β€” Batch 9 (1 item) and Batch 10 (5 items)

Batch 9 (1 item, flushed 23:26:54):

  • 23:27:54 β€” Receipt timeout (60s) β€” tx likely dropped from mempool
  • Classified as transient β†’ retry
  • 23:28:55 β€” Receipt timeout again (60s)
  • Classified as transient β†’ retry
  • 23:28:55 β€” gas estimation reverts (item on-chain from prior attempt)
  • could not parse revert data β†’ simulation
  • Simulation classifies as poison (not idempotent) β†’ 1 "combined create+respond failed"
  • Batch 10 (5 items) flushed at 23:28:56

Batch 10 (5 items, flushed 23:28:56):

  • 23:29:56 β€” Receipt timeout (60s)
  • Classified as transient β†’ retry
  • 23:29:56 β€” gas estimation reverts
  • could not parse revert data β†’ simulation
  • 23:30:57 β€” ALL 5 items classified as poison β†’ 5 "combined create+respond failed"

Phase 4: Zombie Aggregators (23:30:51 – 23:34:06) β€” All 15 BLS aggregator slots expire

All 15 tasks had BLS aggregation slots that were never cancelled. They expire after 360s (6 min), each showing processed 0
signatures. The failed to send task expired error on 6 of them means the oneshot::Receiver was already dropped (the caller in sync.rs
moved on after batch failure).

Root Causes (Ordered by Severity)

  1. parse_batch_partial_failure always returns None (writer.rs:795)
    The as_revert_data() API exists and is already used at writer.rs:1091, but the batch parser is a TODO stub. This means every contract
    revert during retry falls through to simulation, which is slow and unreliable.

  2. Simulation misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
    When identify_combined_poison_items tests a single item, the contract wraps TaskAlreadyExists inside BatchPartialFailure. The
    is_task_already_exists() check sees BatchPartialFailure, not TaskAlreadyExists, and classifies the item as poison instead of
    idempotent.

  3. Receipt timeout (60s) is classified as transient and retried with same items
    After timeout, the first tx may have already landed. Retrying the same batch guarantees a contract revert. The system should
    pre-check on-chain state before retrying after a timeout.

  4. BLS aggregator slots are never cancelled on batch failure
    The cancel_task call exists in sync.rs error paths, but the batch submitter communicates failure via the oneshot channel. If the
    sync.rs caller has already timed out or dropped the receiver, cancel_task is never called. The slots linger for 360s consuming
    memory.

  5. Second batch (5 items) flushed while prior batch is still retrying
    The accumulator loop flushes new items even though the prior batch is stuck in the 60s receipt timeout loop. These 5 items enter the
    same failing retry cascade.

Quantitative Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ Value β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total tasks β”‚ 15 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Successfully on-chain β”‚ 9 (60%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Idempotent success (recovered) β”‚ 2 (13%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Failed (misclassified as poison) β”‚ 6 (40%) β€” all recoverable β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Zombie aggregator slots β”‚ 15 (100%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Receipt timeouts β”‚ 4 (across 2 batch retry cycles) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time wasted in retry loops β”‚ ~4 min β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The core issue hasn't changed from the prior session: implementing parse_batch_partial_failure would have prevented 100% of the
failures. The 6 failed tasks were all on-chain already β€” they were just misidentified as poison because the revert data couldn't be
decoded.

Want me to proceed with implementing the fixes? The priority order would be:

  1. Implement parse_batch_partial_failure (the single-function fix that unblocks the whole chain)
  2. Fix simulation to detect TaskAlreadyExists nested inside BatchPartialFailure
  3. Add pre-check on-chain state before retry after receipt timeout
  4. Ensure cancel_task is called from the batch submitter on final failure

@denniswon
Copy link
Copy Markdown
Author

❯ we need to fix all these issues. proceed. But we need much more fundamental fix to prevent Time wasted in retry loops (not
acceptable). we should leave NO TODOs at all in regards to this problem related items. Review this part of the codebase for onchain
transaction submission handling from the first principles and perform a thorough analysis on the architecture/design/code/logic before
implementing. /grill-me

⏺ Skill(grill-me)
⎿  Successfully loaded skill

⏺ Let me start by thoroughly reading the current batch submission pipeline to understand the architecture from first principles before
asking questions.

⏺ Explore(Read batch submission pipeline)
⎿  Done (56 tool uses · 143.1k tokens · 2m 16s)
(ctrl+o to expand)

⏺ Good, I have the full architecture. Now let me read the exact code for the critical paths to ground my understanding.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/CLAUDE.md
⎿  Loaded crates/chainio/src/CLAUDE.md
⎿  Loaded crates/chainio/src/avs/CLAUDE.md
⎿  Loaded crates/gateway/CLAUDE.md
⎿  Loaded crates/gateway/src/CLAUDE.md
⎿  Loaded crates/gateway/src/rpc/CLAUDE.md
⎿  Loaded crates/gateway/src/rpc/api/CLAUDE.md

⏺ Now let me read the identify_combined_poison_items function and the error type flow.

Read 2 files (ctrl+o to expand)

⏺ Now let me check how the error flows across the trait boundary β€” the TaskSubmitter wraps ChainIoError into GatewayError, which is the
key information loss point.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/gateway/src/task/CLAUDE.md

⏺ Now I have the complete picture of the architecture. Here's my first question:


Question 1: The TaskSubmitter trait erases ChainIoError into GatewayError::Internal(String). Should we change the trait's return type
to preserve structured errors?

The fundamental design flaw is here in submitter.rs:105-114:

async fn batch_create_and_respond_to_tasks(...) -> Result {
AvsWriter::batch_create_and_respond_to_tasks(...)
.await
.map_err(|e| GatewayError::Internal(format!("Failed to batch create and respond: {e}")))
}

This .map_err(format!(...)) destroys ALL structured error data. The ChainIoError::BatchPartialFailure { failures } with per-item
indices and reasons becomes the string "Failed to batch create and respond: Batch partial failure: 3 items failed". Then
extract_failed_indices in batch_submitter.rs tries to recover that data from the string β€” and obviously can't.

My recommendation: Change the TaskSubmitter trait's batch methods to return Result<TransactionReceipt, ChainIoError> directly instead
of Result<TransactionReceipt, GatewayError>. The batch submitter is the only consumer of these methods, and it needs the structured
error data. This is a one-layer change β€” batch_submitter.rs already imports ChainIoError types via newton_prover_chainio. The
GatewayError wrapping would move to the callers that actually need it (sync.rs), not happen at the trait boundary.

The alternative is to implement parse_batch_partial_failure in writer.rs so the structured data is decoded before it crosses the
trait boundary β€” but that only solves one error type. The trait-level fix solves all current and future structured error needs.

Which approach do you prefer?

✻ Crunched for 3m 5s

❯ agree with your recommendation

⏺ Question 2: Receipt timeout currently returns ChainIoError::CreateNewTaskCallFail (a string). After a timeout, the tx may have landed
on-chain. Should the batch submitter check on-chain state before retrying?

Currently at writer.rs:759-763:
let receipt = timeout(Self::BATCH_RECEIPT_TIMEOUT, pending.get_receipt())
.await
.map_err(|_| ChainIoError::CreateNewTaskCallFail {
reason: "batch create+respond receipt timed out after 60s".to_string(),
})?

This is classified as transient by is_transient_rpc_error (matches "timeout"), so the batch submitter retries the same items. But the
tx may have already been mined β€” the receipt poll just didn't catch it. The retry sends a duplicate tx, which reverts with
BatchPartialFailure(TaskAlreadyExists), triggering the broken simulation fallback.

My recommendation: Two changes working together:

  1. Return a distinct error variant β€” ChainIoError::TransactionTimeout { timeout_secs: 60 } (already exists in the enum but unused).
    This lets the batch submitter handle timeouts differently from other transient errors.
  2. In the batch submitter, on receipt timeout: instead of blind retry, call identify_combined_poison_items immediately. This
    simulates each item as a single-item batch. Items already on-chain get TaskAlreadyExists β†’ idempotent success. Items not on-chain
    pass simulation β†’ healthy, re-batched and retried. This eliminates the entire 60sΓ—3 retry cascade (the "4 minutes wasted" from the
    load test).

The alternative is adding an on-chain eth_call check per item (e.g., querying allTaskHashes[taskId]), but simulation already does
this implicitly and handles edge cases like partially-landed batches. The simulation path just needs the TaskAlreadyExists detection
to work β€” which it will once we fix the parse_batch_partial_failure at the writer level.

Does this approach make sense, or do you want a direct on-chain state check instead?

❯ what is your recommendation? is simulation an overkill compared to just eth call?

⏺ Good pushback. Let me think about this concretely.

Simulation = call batch_create_and_respond_to_tasks([single_item]) per item. This is a full contract call that tries to execute
create+respond, consuming significant gas estimation compute on the RPC node. For N items, that's N separate eth_calls, each encoding
the full batch contract ABI with task/response/signature data.

Direct on-chain check = call taskManager.allTaskHashes(taskId) per item. This is a single storage slot read β€” trivially cheap, ~200
gas equivalent, returns bytes32(0) if not created or the hash if created.

For the combined path, if allTaskHashes(taskId) != bytes32(0), the task was created AND responded (atomic), so it's idempotent
success. If it's zero, the task never landed β€” safe to retry.

My revised recommendation: Direct on-chain check via allTaskHashes. It's:

  • 10-100x cheaper per item than simulation
  • Semantically precise (answers exactly "did this task land?")
  • No false positives from unrelated revert reasons
  • Can check all items in a single multicall or sequential loop

We'd add a method to TaskSubmitter like is_task_on_chain(task_id) -> bool that reads allTaskHashes, and call it on receipt timeout
before deciding whether to retry or declare idempotent success.

Simulation stays as the last-resort fallback only for non-timeout contract reverts where we can't parse BatchPartialFailure (which
should be rare once we implement the parser).

❯ sounds good

⏺ Question 3: The identify_combined_poison_items simulation sends actual on-chain transactions (not eth_call). Each "simulation" is a
real batch_create_and_respond_to_tasks call that goes through .send().await and get_receipt(). Is this intentional?

Looking at batch_submitter.rs:648-655:
let sim_result = task_submitter
.batch_create_and_respond_to_tasks(
batch_contract_addr,
vec![task.clone()],
vec![response.clone()],
vec![sig.clone()],
)
.await;

This calls the real AvsWriter::batch_create_and_respond_to_tasks, which calls .send().await β€” submitting an actual transaction, not a
read-only eth_call. If the task is NOT on-chain, the "simulation" actually creates+responds to it on-chain as a side effect. If the
task IS on-chain, the .send() call reverts during gas estimation (before tx submission), so no tx is sent.

This means:

  • For items that are already on-chain: reverts at gas estimation β†’ correctly classified
  • For healthy items: actually submits a real tx β†’ succeeds, but now you have N individual txs instead of one batch

My recommendation: This is actually fine functionally (items get submitted either way), but it's wasteful β€” defeats the purpose of
batching. However, since we're moving to direct allTaskHashes checks for receipt timeouts, simulation will only be reached for
non-timeout contract reverts where parse_batch_partial_failure fails. Once we implement the parser, simulation becomes a truly rare
last-resort fallback.

So: keep simulation as-is for now, but it should almost never be reached after our fixes. The priority path becomes: parse structured
error β†’ strip β†’ retry. Timeout path becomes: check allTaskHashes β†’ classify β†’ retry healthy only.

Agree with keeping simulation as the rare fallback, or do you want to convert it to use eth_call instead of real transactions?

❯ which one is better? which one do you suggest?

⏺ Converting to eth_call is strictly better. Here's why:

Real tx submission (current):

  • Healthy items get submitted as individual txs (N txs instead of 1 batch β€” wastes gas and nonce space)
  • Each goes through the 60s receipt timeout path, potentially cascading further
  • If 2 of 5 items are poison, the 3 healthy ones get submitted individually rather than re-batched

eth_call simulation (proposed):

  • Zero on-chain side effects β€” purely diagnostic
  • Fast (no mempool, no receipt wait β€” just gas estimation)
  • After classification, healthy items stay in the batch for a single re-batched submission
  • No nonce consumption

My recommendation: Convert identify_combined_poison_items to use eth_call via a new TaskSubmitter method like
simulate_batch_create_and_respond(single_item) -> Result<(), ChainIoError>. Under the hood, this calls .call().await instead of
.send().await on the contract instance. The classification logic stays identical β€” we just use the return value to determine healthy
vs poison vs already-exists without actually submitting.

This also eliminates a subtle bug: if simulation succeeds for a "healthy" item (submitting a real tx), and then the re-batched retry
also includes it, you get a TaskAlreadyExists from the item you just accidentally submitted during simulation.

❯ sounds good

⏺ Question 4: The batch accumulator loop flushes new items while a prior batch is stuck in the retry/timeout loop. Should we block
accumulation during flush, or decouple them?

From the load test: Batch 9 (1 item) entered a 60sΓ—2 receipt timeout loop. Meanwhile, 5 new items completed BLS aggregation and
queued up. When batch 9 finally failed at 23:28:56, those 5 items were immediately flushed as batch 10 β€” which then hit the same
cascading timeout pattern.

The current architecture runs the accumulator and flush in the same tokio::spawn task. While flush_combined_batch_with_retry is
await-ing receipt timeouts (up to 180s with 3 retries), the mpsc channel buffers incoming items. When flush returns, the loop
immediately drains the channel and flushes the next batch.

My recommendation: This is actually the correct design β€” we should NOT block accumulation. Items need somewhere to queue while flush
is in progress. The real fix is making flush fast:

  1. Receipt timeout β†’ immediate allTaskHashes check (seconds, not 60s retry)
  2. parse_batch_partial_failure works β†’ instant strip-and-retry (no simulation)
  3. Simulation (rare fallback) uses eth_call β†’ fast, no receipt wait

With these fixes, the worst-case flush time drops from 180s+ (3Γ— receipt timeout) to ~5-10s (one timeout + quick on-chain check + one
retry). The accumulator loop stays as-is β€” the problem was never the architecture, it was the slow error recovery paths.

One small addition: we should cap the number of items that accumulate during a flush to prevent unbounded growth. The channel is
already bounded (max_create_batch_size * 2 = 240), which provides backpressure. That's sufficient.

@denniswon
Copy link
Copy Markdown
Author

❯ await-ing receipt timeouts (up to 180s with 3 retries) --> this is way too long. definitely remove and retries should be just 2 for
batches. agreed overall, but double evaluate your suggestion/recommendation

⏺ You're right β€” let me re-evaluate.

The 60s receipt timeout itself is the problem. On Sepolia, block time is 12s. A legitimate transaction should be mined within 1-2
blocks (12-24s). If we haven't gotten a receipt in 30s, the tx is almost certainly either:

  1. Dropped from mempool (nonce replaced, gas too low)
  2. Mined but the RPC node is lagging

Waiting 60s before checking is just dead time. And with 3 retries, that's potentially 180s of waiting for nothing.

My revised recommendation:

  1. Reduce BATCH_RECEIPT_TIMEOUT to 30s β€” still generous for Sepolia (2.5 blocks). On L2s like Base (2s block time), 30s is 15 blocks
    β€” more than enough.
  2. Max retries = 2 for batch methods (down from 3). After the first timeout, we check allTaskHashes. If items are on-chain β†’
    idempotent success. If not β†’ one more retry with fresh nonce. If that also fails, items are failed out.
  3. On receipt timeout, don't retry the batch blindly. The sequence becomes:
    - Timeout fires after 30s
    - Check allTaskHashes for each item in the batch (~1-2s)
    - Items on-chain β†’ idempotent success
    - Items NOT on-chain β†’ re-batch and retry once (attempt 2)
    - If attempt 2 also times out β†’ fail remaining items

Worst case: 30s timeout + 2s check + 30s retry = ~62s. Current worst case: 180s+. That's a 3x improvement, and the common case (items
landed but receipt timed out) resolves in ~32s.

The max_retries = 2 config applies to the batch submitter config, not the writer-level timeout. The writer always does a single
attempt with timeout β€” retries are orchestrated by the batch submitter.

❯ sounds good.

⏺ Question 5: The parse_batch_partial_failure implementation. The as_revert_data() API is already proven at writer.rs:1091. Should we
decode BatchPartialFailure at the writer level (before the trait boundary) or at the batch submitter level?

Looking at the two options:

Option A: Decode at writer level (current stub location, writer.rs:795):

  • parse_batch_partial_failure uses error.as_revert_data() to get raw bytes
  • Checks for 0x19ed9977 selector, decodes FailedItem[] via BatchPartialFailure::abi_decode_raw
  • Returns Some(Vec) with per-item index, task_id, and nested revert reason
  • The existing Err(ChainIoError::BatchPartialFailure { failures }) path already works β€” it just never gets populated because the
    parser returns None

Option B: Decode at batch submitter level:

  • Would require passing the raw alloy::contract::Error through the trait somehow
  • Trait returns Result<_, ChainIoError> (per Question 1), but ContractError is already a variant

My recommendation: Option A β€” decode at the writer level. This is the natural location because:

  1. The writer has direct access to alloy::contract::Error and as_revert_data()
  2. The ChainIoError::BatchPartialFailure variant already exists with the right fields
  3. With the Question 1 fix (trait returns ChainIoError), the batch submitter gets the structured BatchPartialFailure { failures }
    directly
  4. extract_failed_indices in batch_submitter becomes trivial: just match on the ChainIoError::BatchPartialFailure variant and extract
    indices

The nested revert reason bytes inside each FailedItem.reason should also be decoded where possible β€” check for TaskAlreadyExists
(0x2e98c533), TaskAlreadyResponded (0x68905dff), and other known selectors. This lets strip_combined_by_indices classify each
stripped item correctly (idempotent vs poison) without simulation.

@denniswon
Copy link
Copy Markdown
Author

Excellent point. Let me think about this more carefully.

The root of the cascading failure was ambiguity in error classification. When the system couldn't definitively identify an error, it
fell back to heuristics (string matching, simulation) that guessed wrong. We need to eliminate guessing entirely.

The definitive approach: Every error that crosses the writer β†’ batch submitter boundary must be exhaustively classified at the source
β€” the writer level β€” where we have full access to the raw error type.

Here's the concrete design:

At writer.rs, the batch_create_and_respond_to_tasks method currently has two exit paths:

  1. .send().await succeeds β†’ pending.get_receipt() β†’ either success or timeout
  2. .send().await fails β†’ ContractError (gas estimation revert)

For path 2, instead of the current parse_batch_partial_failure || fallback to ContractError, we do exhaustive error decoding:

let revert_data = error.as_revert_data();
match revert_data {
None => ChainIoError::ContractError(e), // not a revert (RPC/transport error)
Some(data) if data.is_empty() => ChainIoError::CreateNewTaskCallFail {
reason: "empty revert (assembly-level failure)"
},
Some(data) if data.len() < 4 => ChainIoError::CreateNewTaskCallFail {
reason: "malformed revert data (< 4 bytes)"
},
Some(data) => {
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
BatchPartialFailure::SELECTOR => // decode FailedItem[], return BatchPartialFailure
TaskAlreadyExists::SELECTOR => // return new variant TaskAlreadyExists
TaskAlreadyResponded::SELECTOR => // return new variant TaskAlreadyResponded
TaskCreatedBlockTooOld::SELECTOR => // return specific variant
_ => // return ContractRevert { selector, data } β€” preserves raw bytes
}
}
}

Key principle: the selector is always available and always definitive. We decode what we recognize, and for anything we don't
recognize, we preserve the raw selector + data in a new ChainIoError::ContractRevert { selector: [u8; 4], data: Bytes } variant. No
string matching. No guessing. No fallback to simulation for classification.

Then the BatchPartialFailure payload itself contains FailedItem[] where each item has a reason: bytes β€” that's the nested revert from
each individual _createAndRespond call. We apply the same selector-based decoding to each nested reason:

for failure in &failures {
let nested_selector = failure.reason[..4];
match nested_selector {
TaskAlreadyExists::SELECTOR => classification = Idempotent,
TaskAlreadyResponded::SELECTOR => classification = Idempotent,
_ => classification = Poison(selector, data),
}
}

This way the batch submitter never needs to string-match, simulate, or guess. Every item is classified definitively by its 4-byte
selector at the writer level.

For receipt timeout (path 1): we already agreed β€” direct allTaskHashes check, not guessing.

Simulation becomes truly dead-letter: only reachable if somehow as_revert_data() returns None for a contract revert (which would be
an alloy bug, not our code). We can log a warning and still fall through to eth_call-based simulation as the absolute last resort,
but it should never fire in practice.

Does this exhaustive selector-based decoding approach address your concern?

@denniswon
Copy link
Copy Markdown
Author

Question 6: We need new ChainIoError variants for the exhaustive classification. Here's the proposed enum expansion:

Currently the batch path can produce these ChainIoError variants:

  • CreateNewTaskCallFail { reason: String } β€” catch-all string bucket
  • BatchPartialFailure { failures } β€” structured but never populated
  • ContractError(alloy::contract::Error) β€” raw alloy error
  • AlloyProviderError β€” provider-level error

Proposed additions:

// New variants in ChainIoError:

/// Task already exists on-chain (idempotent β€” prior attempt landed)
TaskAlreadyExists { task_id: Option },

/// Task already responded (idempotent β€” prior create+respond landed)
TaskAlreadyResponded { task_id: Option },

/// Contract reverted with a recognized but non-recoverable error
ContractRevert { selector: [u8; 4], data: Bytes },

/// Receipt wait timed out β€” tx may or may not be on-chain
ReceiptTimeout { timeout_secs: u64 },

And update BatchPartialFailure.failures items to carry a classification enum instead of raw bytes:

pub enum BatchItemFailure {
AlreadyExists,
AlreadyResponded,
Revert { selector: [u8; 4], data: Bytes },
Unknown { data: Bytes },
}

pub struct BatchFailedItem {
pub index: usize,
pub task_id: B256,
pub failure: BatchItemFailure,
}

This means extract_failed_indices in the batch submitter becomes a simple pattern match β€” no string parsing, no simulation. Each
failed item is already classified.

My recommendation: This is the right set of variants. ReceiptTimeout replaces the current CreateNewTaskCallFail { reason: "...timed
out..." } string hack. TaskAlreadyExists/TaskAlreadyResponded surface directly instead of being detected via string matching.
ContractRevert catches everything else with preserved data.

The task_id: Option on TaskAlreadyExists/TaskAlreadyResponded is Option because when a whole batch reverts with these, we may
not know which specific task β€” but for BatchPartialFailure nested reasons, we always know the task_id from the FailedItem.

Agree with these variants?

@denniswon
Copy link
Copy Markdown
Author

Good question. Let me evaluate both approaches for check_tasks_on_chain.

Sequential reads vs Multicall

Context: After a receipt timeout on a batch of N items, we need to check allTaskHashes[taskId] for each item to determine which tasks
actually landed on-chain.

Option A: Sequential reads

for item in &items {
let hash = task_manager.allTaskHashes(item.task_id).call().await?;
landed.push(hash != B256::ZERO);
}

Pros:

  • Simple, no extra dependency
  • Each call is independent β€” partial failure doesn't lose results
  • alloy already has the generated binding for allTaskHashes

Cons:

  • N RPC round-trips (one per task in the batch)
  • At batch size 120, that's 120 sequential calls after a timeout β€” adding 10-30s latency on Sepolia
  • Compounds the timeout problem we're trying to fix

Option B: Multicall3

let multicall = Multicall3::new(provider, MULTICALL3_ADDR);
let calls: Vec<_> = items.iter().map(|item| Call3 {
target: task_manager_addr,
callData: allTaskHashesCall { taskId: item.task_id }.abi_encode(),
allowFailure: true,
}).collect();
let results = multicall.aggregate3(calls).call().await?;

Pros:

  • Single RPC round-trip regardless of batch size
  • 120 items checked in ~200ms instead of ~30s
  • Multicall3 is deployed at a deterministic address on all EVM chains (including anvil with --auto-impersonate)

Cons:

  • Needs Multicall3 address constant (but it's 0xcA11bde05977b3631167028862bE2a173976CA11 on every chain)
  • Slightly more complex ABI encoding
  • If Multicall3 isn't deployed (unlikely β€” it's on 99%+ of chains), entire check fails

Option C: eth_call batch simulation (middle ground)

// Simulate the entire batch via eth_call β€” if it succeeds, nothing is on-chain yet
// If it reverts with BatchPartialFailure, decode which items exist
let result = batch_contract
.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await;

This is what we already planned for simulate_batch_create_and_respond. It doubles as the on-chain check β€” items that revert with
TaskAlreadyExists are confirmed on-chain.

Pros:

  • Single RPC call
  • Reuses existing contract ABI (no Multicall3 dependency)
  • Gives us BOTH information: what's on-chain AND what would succeed if resubmitted

Cons:

  • Heavier than a simple storage read (runs full EVM execution per item)
  • Gas estimation overhead (~400K per item Γ— 120 = 48M gas simulated)

My recommendation: Option C (eth_call simulation)

Here's why β€” we already need simulate_batch_create_and_respond for the poison detection path. After a receipt timeout, the question
isn't just "which tasks landed?" β€” it's "which tasks can be resubmitted?" The eth_call simulation answers both in one call:

  • Items that revert with TaskAlreadyExists (0x2e98c533) β†’ confirmed on-chain, send idempotent success
  • Items that revert with TaskAlreadyResponded (0x68905dff) β†’ confirmed on-chain AND responded, send idempotent success
  • Items that succeed β†’ not on-chain, safe to retry
  • Items that revert with other errors β†’ poison, strip them

This means we don't need a separate check_tasks_on_chain method at all. The receipt timeout path becomes:

timeout β†’ simulate_batch(remaining_items) β†’ classify each item β†’ retry/succeed/fail

One method, one RPC call, complete information.

β˜… Insight ─────────────────────────────────────
The key insight is that eth_call simulation is strictly more informative than allTaskHashes reads. A storage read tells you "this
task exists on-chain." A simulation tells you "this task exists on-chain AND here's exactly what would happen if you tried to
resubmit it." Since we need the simulation for poison detection anyway, using it for the timeout path too eliminates an entire method
from the trait surface.
─────────────────────────────────────────────────

So the revised TaskSubmitter trait change is:

// Add ONE new method (not two):
async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError>;
// Returns Ok(()) if all would succeed, BatchPartialFailure if some fail,
// or other ChainIoError for RPC issues

// Change return type on existing method:
async fn batch_create_and_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
// (was Result with anyhow)

Does this approach make sense, or do you still want the separate check_tasks_on_chain method?

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #42/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7 β€” whether Option C (single simulate_batch_create_and_respond method that handles both timeout
recovery and poison detection) works, or if you'd prefer separate methods.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #43/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Still waiting on your input for Q7 before proceeding with implementation.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #44/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #45/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #46/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your input on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #47/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #48/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #49/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #50/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

✻ Cooked for 1m 23s

@denniswon
Copy link
Copy Markdown
Author

⏺ Plan is written. Let me exit plan mode so you can review and approve before implementation begins.

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ready to code?

Here is Claude's plan:
β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ
Harden Batch Transaction Submission Pipeline

Context

PR #507 (feat/batch-create-and-respond) merged and deployed to stagef+prod. Load testing (15 tasks, centralized mode, single gateway)
showed 60% success (9/15) with 6 recoverable tasks misclassified as poison. Root cause chain:

  1. TransportError(NullResp) on batch 8 triggers retry
  2. Receipt timeout (60s x 3 retries = 180s) blocks accumulator
  3. parse_batch_partial_failure() returns None (stub) β€” falls to simulation
  4. Simulation sends REAL transactions (not eth_call), misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
  5. 6 tasks that were actually on-chain get failed to callers
  6. 15 zombie BLS aggregator slots linger 360s

Every failure in this cascade is recoverable. The protocol is correct β€” the error handling pipeline loses information at each layer.

Design Decisions (from grill-me interview)

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Decision β”‚ Rationale β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ TaskSubmitter returns Result<_, ChainIoError> β”‚ Stop losing structured error data via format!() β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Receipt timeout β†’ eth_call simulation β”‚ Determine which items landed on-chain without blind β”‚
β”‚ β”‚ β”‚ retry β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3 β”‚ Poison detection via eth_call (not real txs) β”‚ Current simulation sends real transactions β€” wasteful β”‚
β”‚ β”‚ β”‚ and dangerous β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4 β”‚ Keep accumulator architecture, timeout=30s, max_retries=2 β”‚ Accumulator is sound; reduce wasted time in retry loops β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ Decode BatchPartialFailure at writer level β”‚ Use as_revert_data() (proven at writer.rs:1091) β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6 β”‚ Exhaustive selector-based decoding β”‚ No string matching, no guessing, no fallback defaults β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7 β”‚ Single simulate_batch method for both timeout recovery and β”‚ One eth_call gives complete classification β”‚
β”‚ β”‚ poison detection β”‚ β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Files to Modify

Layer 1: Error types and classification

  1. crates/chainio/src/error.rs β€” Add new ChainIoError variants, expand is_transient()
  2. crates/chainio/src/avs/errors.rs β€” Add exhaustive selector constants and classify_batch_item_error()

Layer 2: Writer (on-chain interaction)

  1. crates/chainio/src/avs/writer.rs β€” Implement parse_batch_partial_failure(), reduce timeout, add
    simulate_batch_create_and_respond()

Layer 3: TaskSubmitter trait bridge

  1. crates/gateway/src/task/submitter.rs β€” Change return types to ChainIoError, add simulate_batch_create_and_respond()

Layer 4: Batch submitter (retry logic)

  1. crates/gateway/src/rpc/api/batch_submitter.rs β€” Rewrite flush logic to use structured errors, remove string matching

Layer 5: Config

  1. crates/gateway/src/rpc/api/batch_submitter.rs β€” Change defaults: max_retries=2

Layer 6: Mock impls (10 total)

  1. All TaskSubmitter test impls β€” add simulate_batch_create_and_respond(), update return types

Step-by-Step Implementation

Step 1: New ChainIoError variants (crates/chainio/src/error.rs)

Add these variants to the existing enum:

/// Task already exists on-chain (idempotent β€” not an error)
#[error("Task {task_id} already exists on-chain")]
TaskAlreadyExists { task_id: B256 },

/// Task already responded to on-chain (idempotent β€” not an error)
#[error("Task {task_id} already responded")]
TaskAlreadyResponded { task_id: B256 },

/// Contract reverted with a known error selector
#[error("Contract reverted: {name} (0x{selector})")]
ContractRevert {
selector: String, // 4-byte hex, e.g. "2e98c533"
name: String, // e.g. "TaskAlreadyExists"
raw_data: Vec, // full revert data for downstream decoding
},

/// Per-item failure within a batch, with selector-decoded reason
#[derive(Debug, Clone)]
pub enum BatchItemError {
TaskAlreadyExists,
TaskAlreadyResponded,
ContractRevert { selector: String, name: String },
Unknown { raw: Vec },
}

Update is_transient() to also match TransactionTimeout:

pub fn is_transient(&self) -> bool {
matches!(
self,
ChainIoError::RpcError()
| ChainIoError::AlloyProviderError(
)
| ChainIoError::TransactionTimeout { .. }
)
}

Step 2: Exhaustive selector classification (crates/chainio/src/avs/errors.rs)

Add classify_batch_item_revert(data: &[u8]) -> BatchItemError:

pub fn classify_batch_item_revert(data: &[u8]) -> BatchItemError {
if data.len() < 4 {
return BatchItemError::Unknown { raw: data.to_vec() };
}
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
[0x2e, 0x98, 0xc5, 0x33] => BatchItemError::TaskAlreadyExists,
[0x68, 0x90, 0x5d, 0xff] => BatchItemError::TaskAlreadyResponded,
_ => {
let name = decode_error_selector(selector)
.unwrap_or_else(|| format!("0x{}", hex::encode(selector)));
BatchItemError::ContractRevert {
selector: hex::encode(selector),
name,
}
}
}
}

Where decode_error_selector is the existing function in writer.rs that maps selectors to names. We'll make it pub and move the
selector-to-name mapping to errors.rs.

Step 3: Implement parse_batch_partial_failure() (crates/chainio/src/avs/writer.rs)

Replace the stub at line 795 with real implementation using as_revert_data():

fn parse_batch_partial_failure(
error: &alloy::contract::Error,
) -> Option<Vec> {
let revert_data = error.as_revert_data()?;
if revert_data.len() < 4 {
return None;
}
let selector: [u8; 4] = revert_data[..4].try_into().ok()?;
// BatchPartialFailure selector: 0x19ed9977
if selector != [0x19, 0xed, 0x99, 0x77] {
return None;
}
// Decode: BatchPartialFailure(FailedItem[] failures)
// FailedItem = (uint256 index, bytes32 taskId, bytes reason)
// Use alloy ABI decoding
use alloy::sol_types::SolType;
type FailedItemTuple = alloy::sol_types::sol_data::Array<
(alloy::sol_types::sol_data::Uint<256>,
alloy::sol_types::sol_data::FixedBytes<32>,
alloy::sol_types::sol_data::Bytes)
>;
let decoded = FailedItemTuple::abi_decode(&revert_data[4..], true).ok()?;
let items = decoded.into_iter().map(|(index, task_id, reason)| {
BatchFailedItem {
index: index.to::(),
task_id: B256::from(task_id),
reason: reason.into(),
}
}).collect();
Some(items)
}

On error path in batch_create_and_respond_to_tasks, use it to return ChainIoError::BatchPartialFailure:

Err(e) => {
if let Some(failures) = parse_batch_partial_failure(&e) {
return Err(ChainIoError::BatchPartialFailure { failures });
}
// Classify the top-level revert
if let Some(revert_data) = e.as_revert_data() {
let item_err = errors::classify_batch_item_revert(&revert_data);
// Convert to appropriate ChainIoError variant
return Err(match item_err { ... });
}
Err(ChainIoError::ContractError(e))
}

Step 4: Add simulate_batch_create_and_respond() to AvsWriter

New method using .call() (eth_call) instead of .send():

pub async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError> {
// Same ABI encoding as batch_create_and_respond_to_tasks
// but use .call().await instead of .send().await
contract.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await
.map(|_| ())
.map_err(|e| {
if let Some(failures) = parse_batch_partial_failure(&e) {
return ChainIoError::BatchPartialFailure { failures };
}
ChainIoError::ContractError(e)
})
}

Step 5: Reduce BATCH_RECEIPT_TIMEOUT (crates/chainio/src/avs/writer.rs)

const BATCH_RECEIPT_TIMEOUT: Duration = Duration::from_secs(30); // was 60

Step 6: Change TaskSubmitter trait (crates/gateway/src/task/submitter.rs)

Change batch method return types from Result (anyhow) to Result<TransactionReceipt, ChainIoError>:

async fn batch_create_and_respond_to_tasks(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<TransactionReceipt, ChainIoError>;

async fn simulate_batch_create_and_respond(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<(), ChainIoError>;

// Also change these existing methods:
async fn batch_create_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
async fn batch_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;

Remove the format!() error wrapping in the AvsWriter impl β€” return ChainIoError directly.

Step 7: Rewrite flush_combined_batch_with_retry() (crates/gateway/src/rpc/api/batch_submitter.rs)

The core rewrite. Replace string-matching error classification with pattern matching on ChainIoError variants.

Change max_retries default from 3 to 2:
fn default_batch_max_retries() -> u32 { 2 } // was 3

New flush logic (pseudocode):

loop {
match task_submitter.batch_create_and_respond_to_tasks(...).await {
Ok(receipt) => {
// Success β€” notify all callers
for (item, task, ..) in built_items {
let _ = item.result_tx.send(Ok((receipt.clone(), task)));
}
return;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Structured per-item failures β€” classify each
let failed_map: HashMap<usize, &BatchFailedItem> = ...;
let mut healthy = vec![];
for (i, item) in built_items.into_iter().enumerate() {
if let Some(failure) = failed_map.get(&i) {
let classification = classify_batch_item_revert(&failure.reason);
match classification {
BatchItemError::TaskAlreadyExists
| BatchItemError::TaskAlreadyResponded => {
// Idempotent success
let _ = item.0.result_tx.send(Ok((dummy_receipt(), item.1)));
}
_ => {
// Non-recoverable β€” fail this item
let _ = item.0.result_tx.send(Err(GatewayError::Internal(...)));
}
}
} else {
healthy.push(item);
}
}
if healthy.is_empty() { return; }
built_items = healthy;
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue; // retry healthy remainder
}
Err(e) if e.is_transient() => {
attempt += 1;
if attempt > max_retries { fail_all; return; }
tokio::time::sleep(backoff).await;
backoff = min(backoff * 2, Duration::from_secs(5));
continue;
}
Err(ChainIoError::TransactionTimeout { .. }) => {
// Receipt timeout β€” simulate to classify
match task_submitter.simulate_batch_create_and_respond(...).await {
Ok(()) => {
// Nothing on-chain yet β€” safe to retry
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Some items landed β€” classify each (same as above)
...
}
Err(e) => {
// Simulation itself failed β€” treat as transient
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
}
}
Err(e) => {
// Unknown non-transient error β€” try simulation to salvage
match task_submitter.simulate_batch_create_and_respond(...).await {
// Same classification as timeout path
...
}
}
}
}

Remove:

  • extract_failed_indices() β€” replaced by direct pattern match on BatchPartialFailure
  • identify_combined_poison_items() β€” replaced by simulate_batch_create_and_respond()
  • is_transient_error() helper β€” replaced by ChainIoError::is_transient()
  • All error_str.contains(...) string matching

Step 8: Update identify_combined_poison_items β†’ simulation via eth_call

Replace the current implementation that sends REAL single-item batch transactions with eth_call simulation:

async fn simulate_and_classify_items(
items: Vec,
task_submitter: &Arc,
batch_contract_addr: Address,
) -> (Vec, Vec, Vec) {
// (healthy, poison, idempotent)
// For each item, call simulate_batch_create_and_respond with a single-item batch
// Classify the result using BatchItemError
}

Step 9: Return TransactionTimeout from writer on receipt timeout

Currently the timeout path in batch_create_and_respond_to_tasks returns a generic error. Change to:

Err(_elapsed) => {
Err(ChainIoError::TransactionTimeout { timeout_secs: BATCH_RECEIPT_TIMEOUT.as_secs() })
}

Step 10: Update all mock TaskSubmitter implementations (10 impls)

Add simulate_batch_create_and_respond returning Ok(()) (default: all items would succeed).
Change return types on batch methods from Result to Result<TransactionReceipt, ChainIoError>.

Impls to update:

  1. SuccessSubmitter β€” batch_submitter.rs tests
  2. FailingSubmitter β€” batch_submitter.rs tests
  3. PoisonDetectingSubmitter β€” batch_submitter.rs tests
  4. StallTaskSubmitter β€” tx_worker.rs tests
  5. MockAvsWriter β€” chain_routing.rs tests
  6. MockAvsWriter β€” common/mod.rs tests
  7. MockAvsWriter β€” send_task.rs tests
  8. DummySubmitter β€” secrets_encrypted.rs tests
  9. RecordingTaskSubmitter β€” create_task_proof_cid.rs tests
  10. AvsWriter production impl β€” submitter.rs

Step 11: Update batch_submitter tests

Update existing tests to verify new error handling:

  • test_batch_create_success β†’ verify structured receipt propagation
  • test_batch_create_with_poison_items β†’ verify BatchPartialFailure decoding and strip
  • test_batch_create_panic_isolation β†’ unchanged (panic handling)
  • test_batch_submit_multiple_items β†’ verify idempotent success for TaskAlreadyExists

Add new tests:

  • test_batch_timeout_then_simulate β€” receipt timeout β†’ simulation classifies items
  • test_batch_partial_failure_decoding β€” verify BatchPartialFailure selector decoding
  • test_batch_max_retries_2 β€” verify max 2 retries (not 3)

Verification

  1. make fmt && make clippy β€” no warnings
  2. cargo test -p newton-prover-chainio --lib β€” writer/error tests pass
  3. cargo test -p newton-prover-gateway --lib β€” batch_submitter tests pass
  4. make gateway_e2e β€” single-chain E2E
  5. make e2e scenario=basic β€” unified E2E (production config path)

@denniswon
Copy link
Copy Markdown
Author

β˜… Insight ─────────────────────────────────────
Here's what this implementation achieves across the full error pipeline:

  1. No more string matching: Every error classification uses typed ChainIoError variants or 4-byte selector matching via
    classify_batch_item_revert(). Zero contains() calls on error strings.
  2. No more stub functions: parse_batch_partial_failure() now uses as_revert_data() to extract and ABI-decode FailedItem[] from
    BatchPartialFailure reverts. extract_failed_indices() (the empty stub) is deleted entirely.
  3. Simulation uses eth_call, not real txs: simulate_and_classify() calls simulate_batch_create_and_respond() which uses .call()
    (read-only) instead of .send(). No more wasting gas on poison detection.
  4. Reduced retry budget: max_retries defaults to 2 (was 3), BATCH_RECEIPT_TIMEOUT is 30s (was 60s). Worst-case cascade: 30s Γ— 2 =
    60s, down from 60s Γ— 3 = 180s.
    ─────────────────────────────────────────────────

Here's a summary of all changes made:

Layer 1: Error types (crates/chainio/src/error.rs)

  • Added BatchItemError enum with selector-based classification
  • Added ChainIoError variants: TaskAlreadyExists, TaskAlreadyResponded, ContractRevert, TransactionTimeout, RetriesExhausted,
    TransactionReverted
  • Added is_transient() (includes TransactionTimeout) and is_idempotent() methods

Layer 2: Error classification (crates/chainio/src/avs/errors.rs)

  • Added selectors module with byte-array constants
  • Added classify_batch_item_revert() β€” exhaustive 4-byte selector classification
  • Added classify_top_level_revert() β€” top-level revert to ChainIoError variant

Layer 3: Writer (crates/chainio/src/avs/writer.rs)

  • BATCH_RECEIPT_TIMEOUT: 60s β†’ 30s
  • Implemented parse_batch_partial_failure() (was stub returning None)
  • Added simulate_batch_create_and_respond() using .call() (eth_call)
  • Fixed all batch method error paths with typed ChainIoError variants

Layer 4: TaskSubmitter trait (crates/gateway/src/task/submitter.rs)

  • Batch methods return Result<, ChainIoError> (was Result<> via GatewayError)
  • Added simulate_batch_create_and_respond() to trait + AvsWriter impl

Layer 5: Batch submitter (crates/gateway/src/rpc/api/batch_submitter.rs)

  • max_retries default: 3 β†’ 2
  • Removed is_transient_error(), extract_failed_indices(), strip_combined_by_indices(), strip_by_indices()
  • Rewrote flush_combined_batch_with_retry() with typed ChainIoError pattern matching
  • Rewrote flush_respond_batch() similarly
  • Replaced identify_combined_poison_items() (real txs) with simulate_and_classify() (eth_call)

Layer 6: Mock impls (6 test files)

  • Updated all 9 mock TaskSubmitter impls with new return types and simulate_batch_create_and_respond()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment