Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created March 26, 2026 22:34
Show Gist options
  • Select an option

  • Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.
Plan: High-Throughput Batch Transaction Submission
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer β€” all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC β†’ NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s Β· p95: 84s Β· p99: 94s
End-to-End (RPC β†’ on-chain response)
- 82/200 completed (41%)
- p50: 138s Β· p95: 192s Β· p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β€” only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/
⎿  Read crates/gateway/src/rpc/api/tx_worker.rs (597 lines)
⎿  Listed directory .claude/
⏺ This is a critical performance issue. The load test data tells a clear story: the gateway's HTTP layer is fine (sub-4s), but the single-threaded TransactionSubmissionWorker is creating a fatal bottleneck β€” 59% of tasks never complete
end-to-end.
Let me deeply explore the architecture before proposing solutions.
⏺ 3 Explore agents finished (ctrl+o to expand)
β”œβ”€ Explore tx submission flow Β· 35 tool uses Β· 114.9k tokens
β”‚ ⎿ Β Done
β”œβ”€ Explore AvsWriter nonce mgmt Β· 22 tool uses Β· 118.9k tokens
β”‚ ⎿ Β Done
└─ Explore gateway config and flow Β· 29 tool uses Β· 106.3k tokens
⎿  Done
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer β€” all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC β†’ NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s Β· p95: 84s Β· p99: 94s
End-to-End (RPC β†’ on-chain response)
- 82/200 completed (41%)
- p50: 138s Β· p95: 192s Β· p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain β€” only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ /clarify-plan
⏺ Skill(/clarify-plan)
⎿  Successfully loaded skill
⏺ Good β€” I've completed a deep exploration of the gateway's transaction submission architecture. Here's what I found before asking questions:
β˜… Insight ─────────────────────────────────────
The core bottleneck is architectural, not configurational. The TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask AND respondToTask) through a single sequential loop β€” one tokio task, one mpsc channel,
one-at-a-time processing. Each transaction blocks on get_receipt() (12-60s waiting for block confirmation). At 40 req/min generating 80 txs/min, but draining at ~2-4 txs/min, the queue is fundamentally oversubscribed by 20-40x.
─────────────────────────────────────────────────
Here are my first three questions:
1. Which chain was this load test running against, and what's the target chain for production?
Block time matters enormously here β€” Base mainnet (2s blocks) vs Sepolia (12s) vs Base Sepolia (2s) changes the theoretical max throughput by 6x. And the nonce pipelining strategy differs by chain's mempool behavior.
2. Is the single-signer constraint (one ECDSA key for both createNewTask and respondToTask) a hard requirement, or can the gateway operate with multiple signer keys?
The current serialization exists solely to prevent nonce conflicts on a shared signer. If we can use N signers (e.g., a pool of hot wallets), we get NΓ— parallelism trivially. If not, we need nonce pipelining (submit tx N+1 before tx N
confirms), which is more complex but still achievable.
3. What's the target throughput and latency you need?
The load test hit 40 req/min and broke at 41% completion. Are we aiming for 40 req/min sustained with >99% completion? 100 req/min? 1000? And what's the acceptable e2e latency β€” is the current 2.8s HTTP response fine (since on-chain is async),
or do callers need on-chain confirmation within a specific window?
✻ Brewed for 2m 59s
❯ 1. sepolia, for prod environment with 10 gateway replicas with 10 task generator signer private keys, 1 for each 2. not a hard requirement I think, but please review the code base (including the solidity contracts code for createNewTask and
respondToTask to confirm my understanding), gateway can operate with multiple signer keys but all task generators have to be configured onchain (createNewTask, and responseToTask guarded with onlyTaskGenerator guard). If my understanding is not
correct though, then nonce pipelining would be needed 3. we are targeting at least 100 RPS (100 requests per second). Acceptable e2e latency for offchain response is around 3 seconds or lower, not longer, ideally around 2 seconds. For onchain,
we have more room, but definitely we want onchain e2e to happen by 1 min of request submission to gateway, ideally within 15-30 seconds (on Ethereum L1/sepolia)
⏺ 2 Explore agents finished (ctrl+o to expand)
β”œβ”€ Explore TaskManager access control Β· 21 tool uses Β· 90.8k tokens
β”‚ ⎿ Β Done
└─ Explore nonce pipelining feasibility Β· 28 tool uses Β· 109.1k tokens
⎿  Done
⏺ Excellent findings. Let me synthesize what we now know before asking the next batch of questions.
β˜… Insight ─────────────────────────────────────
Multiple task generators are fully supported on-chain. The OperatorRegistry uses EnumerableSet.AddressSet β€” both createNewTask and respondToTask use onlyTaskGenerator, and Sepolia config already has 2 generators configured. The old aggregator
role is deprecated. This means the multi-signer parallelism path is wide open β€” no nonce pipelining complexity needed.
The math for 100 RPS on Sepolia (12s blocks): Each task needs 2 txs (create + respond). At 100 RPS, that's 200 txs/sec. With 12s block time, each signer can pipeline ~1 tx per block optimistically, or ~5 txs/block with nonce pipelining. With N
signers doing round-robin, we need enough signers to absorb the throughput. But Sepolia's block gas limit and mempool behavior also constrain this.
─────────────────────────────────────────────────
Batch 2 β€” Architecture & Deployment Constraints:
4. You mentioned 10 gateway replicas with 10 task generator signer keys. Is each replica currently assigned exactly 1 signer key, or do all 10 replicas share the same key today?
This determines whether we're solving a within-replica parallelism problem (single replica needs multiple signers) or a cross-replica coordination problem (multiple replicas competing for the same nonce space). The 10-key setup suggests you've
already anticipated this β€” I want to confirm the current deployment topology.
5. For the 100 RPS target β€” is that 100 newt_createTask calls/sec, or 100 total RPC calls/sec across all methods?
At 100 task creations/sec on Sepolia (12s blocks), we'd need to land ~1,200 createNewTask + ~1,200 respondToTask = 2,400 txs per 12-second block. That's aggressive for a single chain. Are you planning to batch tasks, or is each RPC call = 1
on-chain task?
6. Is the respondToTask submission latency-critical, or only createNewTask?
Currently both flow through the same sequential queue. If respondToTask can be more relaxed (e.g., batched every few seconds), we could prioritize createNewTask in a fast lane and batch responses separately β€” significantly reducing queue
contention.
✻ Cooked for 2m 12s
❯ 4. each replica currently assigned exactly 1 signer key. But this can change if needed. For deploy pipeline for stagef vs prod, you can refer to @~/projects/deploy 5. 100 newt_createTask calls/sec 6. Neither has definitely latency requirement
to happen, but for sure we dont want to incur too much delay or discrepancies between offchain task/task response vs onchain. And if possible, yes, we should explore batching mechanism so that we can for example batch submit task
creation/response every interval (like that will be our "block/batch" time). 6. I initially made it same sequential queue in order to ensure that for a single task, task response NEVER gets submitted if the task has not been created
successfully. There is no difference between createNewTask vs respondToTask in terms of being how much latency critical
⎿  Listed directory ../deploy/
⏺ Great β€” the picture is becoming very clear. Let me ask one final batch to nail down the solution design.
β˜… Insight ─────────────────────────────────────
The math at 100 tasks/sec on Sepolia is extreme. 100 tasks/sec Γ— 2 txs Γ— 12s block time = 2,400 txs per block across the fleet. With 10 replicas at 1 signer each, that's 240 txs/block/replica. Individual transaction submission simply cannot
scale to this β€” batching is mandatory. A BatchTaskManager contract registered as a task generator could batch N createNewTask calls into a single tx, reducing the problem by 1-2 orders of magnitude.
─────────────────────────────────────────────────
Batch 3 β€” Solution Constraints:
7. Is deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable?
This is the highest-leverage change. A contract registered as a task generator that batches 50-100 tasks per transaction would cut the tx count by 50-100x. Same for respondToTask. The alternative β€” submitting 240 individual txs per block per
replica β€” is infeasible on any EVM chain regardless of nonce pipelining. Is adding a new contract to the deployment pipeline acceptable, or are we restricted to off-chain-only changes?
8. For the create→respond ordering invariant: is it acceptable if creates and responds are in different batches submitted sequentially (batch of creates first, then batch of responds a few seconds later)?
This preserves your invariant (no response without a successful create) while allowing independent batching windows. The tradeoff is slightly higher latency for the respond step (waits for the next batch interval).
9. What's the gas budget per transaction on Sepolia/mainnet for your task generator signers?
Batching 100 createNewTask calls in one tx could cost 20-50M gas. Sepolia's block gas limit is 30M. If individual createNewTask costs ~200-300K gas, we can fit ~100 per batch. But if it costs 500K+, we're limited to ~60. Do you know the
approximate gas cost of a single createNewTask and respondToTask today?
✻ Crunched for 39s
❯ 7. deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable 8. yes acceptable 9. I do not, is there a way to estimate?
⏺ Let me estimate gas costs from the existing test suite and contract code.
⏺ Bash(cd /Users/dennis.won/projects/newton-prover-avs/contracts && forge test --match-test "test_createTask\|test_respondToTask\|test_createNewTask" --gas-report -vv…)
⎿  Compiling 150 files with Solc 0.8.27
Solc 0.8.27 finished in 22.85s
Compiler run successful with warnings:
… +13 lines (ctrl+o to expand)
⎿  (timeout 2m)
⏺ Explore(Explore createNewTask gas usage)
⎿  Done (36 tool uses · 91.3k tokens · 59s)
(ctrl+o to expand)
⏺ Good β€” gas estimates from code analysis:
- createNewTask: ~100-150K gas
- respondToTask: ~400-800K gas (BLS verification dominates)
- Sepolia block gas limit: 30M
β˜… Insight ─────────────────────────────────────
Batching math is very favorable. At 150K gas/create, we can fit ~200 creates per batch tx. At 800K gas/respond, ~37 responds per batch. Per replica at 10 tasks/sec over a 12s block: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = ~5
txs per block per signer. That's trivially achievable. Batching turns a 240-tx-per-block problem into a 5-tx-per-block problem.
─────────────────────────────────────────────────
I now have enough context to design the solution. Let me move to the Design phase.
⏺ Plan(Design batch tx submission plan)
⎿  Done (49 tool uses · 99.0k tokens · 3m 32s)
(ctrl+o to expand)
⏺ The plan from the design agent is comprehensive. Let me verify one critical detail before writing the final plan.
⎿  Loaded ../.claude/rules/solidity.md
Searched for 1 pattern, read 1 file (ctrl+o to expand)
⏺ Now I have the full picture. Let me read the rest of respondToTask to confirm the batch contract's parameter requirements, then write the final plan.
Read 1 file (ctrl+o to expand)
⏺ Good β€” I have the full picture. Now let me write the final plan.
⏺ Updated plan
⎿  /plan to preview
⏺ The plan is written. Here's the summary:
β˜… Insight ─────────────────────────────────────
The key insight is that batching transforms the problem entirely. Instead of 240 individual transactions competing for block space per replica per block (impossible), we get 5 batch transactions per block (trivial). The BatchTaskManager
contract is the highest-leverage change β€” it reduces the on-chain tx count by ~50x. The Rust-side batch accumulator then collects tasks into time-windowed batches (2s intervals), and nonce pipelining ensures those 5 txs don't block
sequentially on receipt confirmation. The entire solution is feature-gated behind batch_submitter.enabled, so the old sequential worker stays as fallback.
The ordering invariant (no respondToTask before createNewTask confirms) is preserved via a DashMap<TaskId, ()> shared between the create and respond accumulators β€” responds are held until their corresponding create is confirmed on-chain.
─────────────────────────────────────────────────
4 layers, 6 implementation steps:
1. BatchTaskManager.sol β€” batch N calls into 1 tx (Solidity, no Rust dependency)
2. AvsWriter batch methods β€” Rust bindings to call the batch contract
3. BatchTransactionSubmitter β€” replaces tx_worker.rs with time-windowed batch accumulation
4. Nonce pipelining β€” submit 5 batch txs concurrently, await receipts in parallel
5. Integration testing + load test at 100 tasks/sec
6. Multi-signer pool (only if needed β€” math says it won't be)
⏺ User approved Claude's plan
⎿  Plan saved to: ~/.claude/plans/valiant-giggling-moler.md · /plan to edit
Plan: High-Throughput Batch Transaction Submission
Context
Load testing at just 40 req/min revealed a critical bottleneck: TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask and respondToTask) through a single sequential loop, blocking on get_receipt() (12-60s per
tx on Sepolia). Only 41% of tasks completed end-to-end; 64.5% hit the 180s timeout.
Target: 100 tasks/sec sustained across 10 gateway replicas (10/sec/replica), with <3s offchain response and <30s on-chain e2e on Sepolia (12s blocks).
Root cause: The worker can process ~2-4 txs/min but needs 200 txs/sec. Throughput gap: ~3000x.
Solution: Four coordinated layers β€” a Solidity batch contract, a Rust batch accumulator, nonce pipelining, and an optional multi-signer pool.
---
Layer 1: BatchTaskManager Solidity Contract
What
A new standalone contract that batches N createNewTask and N respondToTask calls into single transactions. Registered as a task generator on OperatorRegistry.
Why
At 150K gas/create, one batch tx fits ~200 creates (30M block gas limit). At 800K gas/respond, ~37 per batch. Per replica per 12s block at 10 tasks/sec: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = 5 txs per block instead of
240. This is the highest-leverage change.
Design
Error semantics: try/catch per item (not revert-all). A single TaskAlreadyExists or TaskAlreadyResponded in a batch of 120 must not revert the other 119.
File: contracts/src/BatchTaskManager.sol
contract BatchTaskManager {
INewtonProverTaskManager public immutable taskManager;
// Per-item result events
event BatchCreateResult(bytes32 indexed taskId, bool success, bytes reason);
event BatchRespondResult(bytes32 indexed taskId, bool success, bytes reason);
constructor(address _taskManager) {
taskManager = INewtonProverTaskManager(_taskManager);
}
function batchCreateTasks(
INewtonProverTaskManager.Task[] calldata tasks
) external returns (uint256 successCount) {
for (uint256 i; i < tasks.length;) {
try taskManager.createNewTask(tasks[i]) {
emit BatchCreateResult(tasks[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchCreateResult(tasks[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
function batchRespondToTasks(
INewtonProverTaskManager.Task[] calldata tasks,
INewtonProverTaskManager.TaskResponse[] calldata responses,
bytes[] calldata signatureDataArray
) external returns (uint256 successCount) {
require(tasks.length == responses.length && responses.length == signatureDataArray.length);
for (uint256 i; i < tasks.length;) {
try taskManager.respondToTask(tasks[i], responses[i], signatureDataArray[i]) {
emit BatchRespondResult(responses[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchRespondResult(responses[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
}
Access control: The BatchTaskManager address is registered as a task generator via OperatorRegistry.addTaskGenerator(). The contract itself needs no auth modifier β€” only registered task generators can call taskManager.createNewTask (the
contract is the msg.sender to TaskManager). Optionally add an onlyAuthorized mapping to prevent gas griefing from arbitrary callers.
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/src/BatchTaskManager.sol β”‚ Create β€” batch contract β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/src/interfaces/IBatchTaskManager.sol β”‚ Create β€” interface for Rust bindings β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/test/BatchTaskManager.t.sol β”‚ Create β€” unit tests β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ contracts/script/BatchTaskManagerDeployer.s.sol β”‚ Create β€” deployment script following DeploymentLib pattern β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Deployment steps
1. Deploy BatchTaskManager with existing TaskManager address
2. Call OperatorRegistry.addTaskGenerator(batchTaskManagerAddress)
3. Write deployment to script/deployments/batch-task-manager/{chainId}-{env}.json
4. Run make generate-bindings to auto-generate Rust types in crates/core/src/generated/
---
Layer 2: Batch Accumulator (Rust β€” replaces tx_worker.rs)
What
Replace the sequential TransactionSubmissionWorker with BatchTransactionSubmitter that accumulates tasks into time-windowed batches and submits via the BatchTaskManager contract.
Architecture
sync.rs ─────► CreateBatch Accumulator ──flush every 2s──► batchCreateTasks() on-chain
β”‚
(tracks confirmations via DashMap)
β”‚
sync.rs ─────► RespondBatch Accumulator ─flush after creates confirm──► batchRespondToTasks()
Two independent accumulator loops as background tokio tasks, connected by a shared DashMap<TaskId, ()> that tracks which creates have been confirmed on-chain.
Batch flush triggers
1. Timer: batch_interval_ms (default 2000ms β€” one batch per ~1 block on Sepolia)
2. Size: batch reaches max_batch_size
3. Both accumulator loops run independently
Ordering invariant
The respond accumulator holds items in a pending queue. On each flush tick, it moves items whose task_id appears in the confirmed_creates: DashMap<TaskId, ()> into the current respond batch. Items waiting longer than respond_wait_timeout_ms
(default 60s) are dropped with an error log.
Public interface (backward compatible)
BatchTransactionSubmitter exposes the same methods as TransactionSubmissionWorker:
- submit_create_task(task_id, params) -> oneshot::Receiver<Result<...>>
- submit_aggregated_response(task_id, task_request, service_response) -> Result<()>
- queue_capacity() -> usize / queue_depth() -> usize
This means sync.rs and handler/mod.rs only swap the type name.
Config
pub struct BatchSubmitterConfig {
pub enabled: bool, // default: false (feature-gated rollout)
pub max_create_batch_size: usize, // default: 120
pub max_respond_batch_size: usize, // default: 30
pub batch_interval_ms: u64, // default: 2000
pub respond_wait_timeout_ms: u64, // default: 60000
pub batch_contract_address: Address, // from deployment JSON
}
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Create β€” new batch accumulator β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/mod.rs β”‚ Modify β€” add mod batch_submitter export β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/config.rs β”‚ Modify β€” add BatchSubmitterConfig to GatewayConfig β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/handler/mod.rs β”‚ Modify β€” instantiate BatchTransactionSubmitter when enabled (lines ~549, ~856) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Modify β€” add batch_create_tasks() and batch_respond_to_tasks() methods β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/task/submitter.rs β”‚ Modify β€” add batch methods to TaskSubmitter trait (or create BatchTaskSubmitter trait) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Migration path
1. Phase A: Ship behind batch_submitter.enabled = false. Old TransactionSubmissionWorker remains default.
2. Phase B: Enable on stagef, load test. Flip default to true.
3. Phase C: Remove TransactionSubmissionWorker after production validation.
---
Layer 3: Nonce Pipelining
What
After batching reduces txs from 240/block to ~5/block, we still need those 5 txs submitted without blocking on each receipt sequentially. Split send() from get_receipt().
Design
In the batch flush path:
// Fire phase: submit all batch txs with incrementing nonces
let pending_txs: Vec<PendingTransaction> = vec![];
for batch in batches {
let pending = batch_contract.batchCreateTasks(batch).send().await?;
pending_txs.push(pending);
}
// Await phase: concurrently wait for all receipts
let receipts = futures::future::join_all(
pending_txs.into_iter().map(|tx| tx.get_receipt())
).await;
Alloy's nonce manager auto-increments on each send() call, so nonces [N, N+1, N+2, ...] are assigned correctly without manual tracking.
Recovery
If tx at nonce N fails after N+1 is in mempool: N+1 is stuck. The recovery path:
1. Detect the failure from the receipt
2. Re-submit N with bumped gas (replacement tx)
3. Or cancel N with a zero-value self-transfer and re-submit the batch
Files to create/modify
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Modify β€” add send_without_receipt() variant for pipelined submission β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Modify β€” use pipelined send in flush logic β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Layer 4: Multi-Signer Pool (Optional β€” only if needed)
What
Round-robin across N signer keys per replica. Each signer has independent nonce space.
When to implement
Only if load testing after Layers 1-3 shows a single signer can't keep up. The math says 5 txs/block is well within single-signer capacity, so this is likely unnecessary.
Design
pub struct SignerPool {
signers: Vec<Arc<AvsWriter>>,
next: AtomicUsize,
}
All signer addresses call the same BatchTaskManager contract (which is the registered task generator). No on-chain changes needed beyond Layer 1.
Files
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/signer_pool.rs β”‚ Create (only if needed) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/config.rs β”‚ Modify β€” add additional_signers to TaskGeneratorConfig β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Expected Performance
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ Current β”‚ After Fix β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Offchain response β”‚ 2.8s (OK) β”‚ 2.8s (unchanged) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ On-chain task creation β”‚ p50=50s β”‚ ~14s (2s accumulate + 12s block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ On-chain e2e β”‚ p50=138s β”‚ ~28s (create confirm + 2s + 12s block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Completion rate β”‚ 41% β”‚ >99% β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Max throughput/replica β”‚ ~2-4 txs/min β”‚ 10 tasks/sec (120/block) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Txs per block/replica β”‚ 240 (impossible) β”‚ 5 (trivial) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
---
Implementation Sequence
Step 1: BatchTaskManager contract (no Rust dependency)
1. Create contracts/src/BatchTaskManager.sol and interface
2. Create contracts/test/BatchTaskManager.t.sol with tests:
- All succeed, partial failure, auth checks, gas measurement
3. Create deployment script following DeploymentLib pattern
4. forge test && forge fmt
5. Run make generate-bindings
Step 2: AvsWriter batch methods (depends on Step 1 bindings)
1. Add batch_create_tasks() and batch_respond_to_tasks() to AvsWriter
2. Unit tests for the new methods
Step 3: BatchTransactionSubmitter (depends on Step 2)
1. Implement BatchTransactionSubmitter in batch_submitter.rs
2. Add BatchSubmitterConfig to GatewayConfig
3. Wire into handler/mod.rs behind enabled flag
4. Unit tests for accumulator logic, ordering invariant, flush triggers
Step 4: Nonce pipelining (depends on Step 3)
1. Add pipelined send to the flush path
2. Add nonce gap recovery logic
Step 5: Integration testing
1. Update E2E tests to exercise batch path
2. Load test: 100 tasks/sec sustained for 5 minutes
3. Verify ordering invariant
4. Verify partial batch failure handling
Step 6: Multi-signer pool (only if Step 5 load test shows need)
---
Verification
1. forge test β€” all BatchTaskManager contract tests pass
2. make clippy β€” no new warnings
3. make tests β€” existing unit tests unaffected (feature-gated)
4. make gateway_e2e β€” existing E2E still works with old worker (default)
5. make gateway_e2e with BATCH_SUBMITTER_ENABLED=true β€” E2E works with new batch path
6. Load test at 100 tasks/sec for 5 min: >99% completion, <30s on-chain e2e
---
Risks and Mitigations
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Risk β”‚ Mitigation β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Batch contract deployment fails β”‚ Old sequential worker remains as fallback (feature flag) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Partial batch failure loses tasks β”‚ try/catch per item; failures logged with task ID for retry β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Nonce gap from failed pipeline tx β”‚ Re-query nonce, re-submit remaining; fallback to sequential β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Respond submitted before create confirms β”‚ DashMap gating enforces ordering; items time out rather than submit early β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Gas estimation wrong for large batches β”‚ Conservative defaults (120 creates, 30 responds); configurable per env β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Offchain response latency regresses β”‚ Batch accumulator is completely decoupled from RPC response path β€” no impact β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
@denniswon
Copy link
Copy Markdown
Author


Remaining Steps to Deploy BatchTaskManager to Stagef/Prod

What PR #476 gives you (code-complete, ready to merge)

  • BatchTaskManager.sol contract + tests
  • BatchTaskManagerDeployer.s.sol deployment script
  • Rust bindings, AvsWriter batch methods, BatchTransactionSubmitter
  • Gateway wiring (opt-in via batch_submitter config)

What still needs to happen after merge

Step 1: Deploy BatchTaskManager contract on each chain

The deployer script (BatchTaskManagerDeployer.s.sol) handles everything in one tx:

  1. Deploys BatchTaskManager(taskManagerAddr, deployer)
  2. Registers it as a task generator on OperatorRegistry
  3. Authorizes all taskGenerator addresses from config as batch callers

Per chain:

Sepolia (source chain)

DEPLOYMENT_ENV=stagef forge script script/BatchTaskManagerDeployer.s.sol
--rpc-url $SEPOLIA_RPC_URL --private-key $DEPLOYER_KEY --broadcast -vvvv

Base Sepolia (destination chain)

DEPLOYMENT_ENV=stagef forge script script/BatchTaskManagerDeployer.s.sol
--rpc-url $BASE_SEPOLIA_RPC_URL --private-key $DEPLOYER_KEY --broadcast -vvvv

This writes output to script/deployments/batch-task-manager/{chainId}-stagef.json.

β˜… Insight ─────────────────────────────────────
The taskGenerator array in newton_prover_config.stagef.json already has the gateway signer addresses for each chain
(e.g., two addresses for Sepolia: 0xD450... and 0xd116...). The deployer reads these and calls addAuthorized() for each,
so every gateway replica can submit batch txs. No manual authorization step needed.
─────────────────────────────────────────────────

Step 2: Configure gateway env vars in deploy repo

Add to the ECS task definition (in newton-prover-avs-deploy):

GATEWAY__BATCH_SUBMITTER__ENABLED=true
GATEWAY__BATCH_SUBMITTER__BATCH_CONTRACT_ADDRESS=<address from step 1>

All other config fields have sane defaults (120 max create batch, 30 max respond batch, 2s interval, 3 retries). The
gateway falls back to the legacy TransactionSubmissionWorker when batch_submitter is absent or enabled=false.

Step 3: Deploy updated gateway binary

The deploy trigger auto-deploys to stagef on merge to main. No separate action needed β€” the CDK pipeline picks up both
the new binary and new env vars.

Step 4: Verify on stagef

Run a load test similar to the one in the gist (10 users, 15s pacing, 5 min):

  • Expect on-chain completion rate to jump from ~41% β†’ >95%
  • Expect e2e latency to drop from ~138s p50 β†’ <30s target

Step 5: Prod deployment

  1. Deploy BatchTaskManager on mainnet chains:
    DEPLOYMENT_ENV=prod forge script script/BatchTaskManagerDeployer.s.sol
    --rpc-url $MAINNET_RPC_URL --private-key $DEPLOYER_KEY --broadcast -vvvv
  2. Add same env vars to prod ECS config
  3. Manual workflow_dispatch with deploy_prod: true (reminder: prod deploys are never automatic)

What does NOT need integration with other Newton contracts

  • No upgrade to NewtonProverTaskManager β€” BatchTaskManager is a standalone wrapper that calls the existing TaskManager's
    createNewTask and respondToTask
  • No changes to ChallengeVerifier, ServiceManager, or cross-chain contracts β€” batch submission is purely a gateway-side
    optimization
  • No transporter or challenger changes β€” they don't submit tasks

Summary table

β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step β”‚ Action β”‚ Blocking? β”‚ Who β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Merge PR #476 β”‚ Yes β”‚ Dennis (pending 0xkr8os β”‚
β”‚ β”‚ β”‚ β”‚ re-approve) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ Deploy BatchTaskManager on Sepolia + Base Sepolia β”‚ Yes β”‚ Dennis (forge script) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Add GATEWAY__BATCH_SUBMITTER__* env vars to deploy β”‚ Yes β”‚ Dennis (deploy repo PR) β”‚
β”‚ β”‚ repo β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3 β”‚ Gateway deploys to stagef β”‚ Auto on β”‚ CI β”‚
β”‚ β”‚ β”‚ merge β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4 β”‚ Load test on stagef β”‚ Yes β”‚ Dennis β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ Deploy contract + gateway to prod β”‚ Manual β”‚ Dennis β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The contract deployment is the only on-chain step. Everything else is config and binary deployment.

@denniswon
Copy link
Copy Markdown
Author


Implementation Plan

  1. Document the config architecture pattern

Add to CLAUDE.md the three-tier config pattern:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Source β”‚ What goes here β”‚ Examples β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Deployment JSONs β”‚ Addresses of contracts we β”‚ newtonProverTaskManager, batchTaskManager, β”‚
β”‚ (script/deployments/) β”‚ deploy and manage β”‚ operatorRegistry β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Config JSONs β”‚ External/configurable values β”‚ aggregator_addr (EOA), taskGenerator (EOAs), β”‚
β”‚ (newton_prover_config.{env}.json) β”‚ we don't manage β”‚ task_response_window_block, feature flags β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Environment variables β”‚ Secrets and runtime-only β”‚ PRIVATE_KEY, DATABASE_URL, REDIS_URL β”‚
β”‚ β”‚ config β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  1. Add batchTaskManager to DeploymentLib.DeploymentData
  • Add field to DeploymentData struct in DeploymentLib.sol
  • Use readAddressOr pattern so existing JSONs don't break
  • BatchTaskManagerDeployer writes to newton-prover/{chainId}-{env}.json (not a separate directory)
  • Update existing deployment JSONs to include the field
  1. Integrate into deploy-avs.sh
  • Post-step after core AVS contracts: deploy BatchTaskManager if not already deployed
  • On upgrade=true: skip BatchTaskManager if address already exists in deployment JSON
  • Register as task generator on OperatorRegistry, authorize configured taskGenerator addresses
  • Support batch=false to skip (for debugging)
  1. Gateway config refactor
  • Remove batch_contract_address from BatchSubmitterConfig
  • Gateway reads batchTaskManager from deployment JSON / NewtonAvsConfig at startup
  • batch_submitter.enabled defaults to true
  • Keep batch=false / GATEWAY__BATCH_SUBMITTER__ENABLED=false as escape hatch for legacy path
  1. Integration tests default to batch mode
  • All E2E test infra deploys BatchTaskManager as part of setup
  • Gateway starts with batch submitter enabled by default
  • Add batch=false flag to test targets for legacy path testing
  • Update make gateway (local dev) to include batch contract in deploy-all
  1. CI cache version bump
  • Bump anvil state cache version so CI picks up fresh state with BatchTaskManager
  1. Makefile target
  • Add deploy-batch-task-manager standalone target (for manual one-off deploys on stagef/prod)
  • deploy-all calls it automatically after deploy-avs

Want me to proceed with implementation? This is a substantial set of changes across Solidity deploy scripts, Rust gateway
config, Makefile, E2E test infra, and CLAUDE.md. I'd suggest tackling them in the order listed above.

@denniswon
Copy link
Copy Markdown
Author

The batchTaskManager address (0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900) IS present in newton-cross-chain/84532-stagef.json, so the
batch submitter should be initialized. The issue is likely that the on-chain submission is silently failing.

β˜… Insight ─────────────────────────────────────
Your Datadog filter is searching by task ID (l393a98f3a585a7cb754dcfa6b). The batch submitter's flush logs ("flushing create batch",
"flushing respond batch") don't include individual task IDs β€” they batch multiple tasks together. So these logs won't appear in a
task-ID-filtered search. Try removing the task ID filter and searching for "flushing" or "batch" to see if the on-chain submissions
are happening and/or failing.
─────────────────────────────────────────────────

That said, the most likely reason all submissions are failing is that the BatchTaskManager contract isn't registered as a
taskGenerator on the NewtonProverTaskManager. Let me verify.

Searched for 2 patterns (ctrl+o to expand)

⏺ Now let me check how the BatchTaskManager contract calls the TaskManager β€” specifically whether msg.sender to createNewTask would be
the BatchTaskManager contract address.

Read 1 file (ctrl+o to expand)

⏺ There it is. The BatchTaskManager contract calls taskManager.createNewTask(tasks[i]) at line 77 and taskManager.respondToTask(...) at
line 118. When it does, msg.sender to the TaskManager is the BatchTaskManager contract address, not the gateway EOA.

The TaskManager's onlyTaskGenerator modifier at NewtonProverTaskManagerShared.sol:30-36 checks:
IOperatorRegistry(operatorRegistry).isTaskGenerator(msg.sender)

If the BatchTaskManager contract (0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900) was not registered as a taskGenerator in the
OperatorRegistry on Base Sepolia, every batch call reverts with OnlyTaskGenerator. The gateway's EOA was already registered, but the
new BatchTaskManager contract is a separate address.

β˜… Insight ─────────────────────────────────────
This is a classic delegation pattern issue. Before PR #476, the gateway EOA called createNewTask/respondToTask directly on the
TaskManager (EOA β†’ TaskManager). Now with batching, the call chain is EOA β†’ BatchTaskManager β†’ TaskManager. The TaskManager sees
msg.sender = BatchTaskManager contract, not the gateway EOA. The onlyTaskGenerator check fails because the BatchTaskManager address
was never added to the OperatorRegistry's task generator allowlist.

The batch contract's NatSpec even says: "This contract is registered as a task generator on OperatorRegistry" (line 9) β€” but this
registration is a deployment step that may have been missed.
─────────────────────────────────────────────────

Two issues to fix

  1. Register BatchTaskManager as a taskGenerator (critical β€” blocking all submissions)

On Base Sepolia (84532), call OperatorRegistry.addTaskGenerator()

cast send <OPERATOR_REGISTRY_ADDR>
"addTaskGenerator(address)"
0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900
--rpc-url <BASE_SEPOLIA_RPC>
--private-key <ADMIN_KEY>

The OperatorRegistry address for stagef on Base Sepolia is 0xfd779c14ca9e4e2275fe11ff7b80742bf3e0da4f (from the deployment JSON).

Also register the gateway EOA as authorized on the BatchTaskManager itself (line 31 onlyAuthorized modifier):

cast send 0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900
"addAuthorized(address)"
<GATEWAY_TASK_GENERATOR_EOA>
--rpc-url <BASE_SEPOLIA_RPC>
--private-key <BATCH_TASK_MANAGER_OWNER_KEY>

  1. Policy schema mismatch (the original error from dd.csv)

The max_daily_spend schema validation failure is a separate, pre-existing issue β€” the IPFS schema expects max_daily_spend at root but
the on-chain params nest it under agents.

. Fix by updating either the schema or the params.

Why you see no error logs in Datadog

The batch submitter's flush errors would appear at error! level but potentially under a different service tag or without the task ID
you're filtering on. Try searching Datadog for:

  • "flushing create batch" or "batch create" without a task ID filter
  • "OnlyTaskGenerator" in error messages
  • Error/Warn levels in the newton-prover-gateway service around the same timeframe

@denniswon
Copy link
Copy Markdown
Author

Can you check Datadog for these (no task ID filter, full time range):

  1. "batch create accumulator cancelled" or "batch respond accumulator cancelled"
  2. "batch transaction submitter started" β€” when was the last occurrence?
  3. Any Error level logs between 12:39 and 15:43 from newton-prover-gateway

Immediate fix: restart the gateway service. This will restart the batch accumulator loops.

Root cause to investigate: what killed the accumulator. The flush_create_batch_with_retry has catch_unwind (line 390-392) to protect
against panics during flush, but a panic elsewhere in the loop (e.g., during channel receive or timeout handling) would silently kill
the spawned task.

✻ Sautéed for 1m 10s

❯ [Image #7] I dont see "batch create accumulator cancelled" or "batch respond accumulator cancelled". no error logs from Error level
logs between 12:39 and 12:43 from newton-prover-gateway
⎿  [Image #7]

⏺ This confirms it. The batch submitter started at 11:15 and was working at 12:38-12:39. Then it silently died β€” no cancellation, no
error, no crash log.

The accumulator loops run inside tokio::spawn. When a spawned task panics, the panic goes to stderr β€” not to structured
tracing/Datadog unless a custom panic hook is configured. This is why there are zero Error logs.

Immediate fix

Restart the gateway. This will restart the batch accumulator loops and the pending creates will start flowing again.

Code fix needed

The accumulator loops need panic protection around the entire loop body, not just during flush. Currently catch_unwind only wraps the
flush_create_batch_with_retry call (line 390), but a panic anywhere else in the loop (channel receive, item processing, timer tick
handling) silently kills the spawned task.

Read 1 file (ctrl+o to expand)

⏺ Looking at this code, I now think the issue isn't a panic β€” it's a hung RPC call. The accumulator loop is likely stuck inside
flush_create_batch_with_retry at line 344, which calls:

// writer.rs:503-507
let send_result = contract.batchCreateTasks(batch_tasks).send().await;
// ...
let receipt = pending.get_receipt().await // ← can hang forever

If a batch tx was sent but the receipt never arrives (tx dropped from mempool, gas too low, RPC node stall), get_receipt().await
blocks indefinitely. The accumulator loop is stuck waiting, all subsequent items queue up in the channel but are never processed.

This explains everything:

  • No error/panic logs (nothing failed β€” it's just waiting)
  • No cancellation logs (loop didn't exit)
  • Creates queued successfully (channel has capacity)
  • But no flush ever fires (loop is blocked on the stuck receipt)

Immediate fix

Restart the gateway.

Code fix

The batch_create_tasks call needs a timeout. Want me to add one? It would wrap the send+receipt in tokio::time::timeout:

// In AvsWriter::batch_create_tasks, wrap the receipt wait
let receipt = tokio::time::timeout(
Duration::from_secs(60),
pending.get_receipt()
).await
.map_err(|_| ChainIoError::CreateNewTaskCallFail {
reason: "batch create receipt timed out after 60s".into()
})??;

Also need to check receipt.status() β€” per lessons.md, get_receipt() doesn't propagate on-chain reverts.

@denniswon
Copy link
Copy Markdown
Author

β˜… Insight ─────────────────────────────────────
What these changes fix:

  1. Receipt timeout (60s) β€” get_receipt().await can hang indefinitely if a tx is dropped from the mempool. The timeout ensures the
    accumulator loop recovers and continues processing subsequent batches instead of silently dying.
  2. Receipt status check β€” Per lessons.md, .get_receipt() returns Ok even for on-chain reverts. Without checking receipt.status(), the
    batch submitter would mark failed creates as "confirmed" in the DashMap, then the respond accumulator would try to submit responses
    for tasks that don't actually exist on-chain.
  3. Improved logging β€” Respond submissions now log at INFO level (was DEBUG), making successful on-chain submissions visible in
    Datadog. The batch summary log includes success/fail counts for quick health monitoring.
    ─────────────────────────────────────────────────

Summary of changes

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ File β”‚ Change β”‚ Why β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ 60s timeout on get_receipt() in both β”‚ Prevents accumulator loop from β”‚
β”‚ β”‚ batch_create_tasks and batch_respond_to_tasks β”‚ hanging forever on dropped txs β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ receipt.status() check after receipt β”‚ Detects on-chain reverts that β”‚
β”‚ β”‚ β”‚ get_receipt() silently swallows β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/chainio/src/avs/writer.rs β”‚ Log tx_hash on success β”‚ Makes on-chain txs traceable in β”‚
β”‚ β”‚ β”‚ Datadog β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Panic payload in error log (?panic_info) β”‚ Makes panic root cause visible β”‚
β”‚ β”‚ β”‚ in Datadog β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Respond success log promoted to INFO β”‚ Makes successful responds β”‚
β”‚ β”‚ β”‚ visible in Datadog β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ crates/gateway/src/rpc/api/batch_submitter.rs β”‚ Batch flush summary with success/fail counts β”‚ Quick health monitoring per β”‚
β”‚ β”‚ β”‚ flush cycle β”‚
└───────────────────────────────────────────────┴───────────────────────────────────────────────┴─────────────────────────

@denniswon
Copy link
Copy Markdown
Author

⏺ You're right β€” I was wrong about the duplicate task ID hypothesis. Here's why:

β˜… Insight ─────────────────────────────────────
Each RPC request hits exactly ONE gateway (ALB routing), which generates a unique task_id via keccak256(Uuid). The request.id comes
from the JSON-RPC id field β€” if the client provides a UUID, it's used as seed; otherwise Uuid::new_v4() generates a random one. Either
way, each gateway processes a different request with a different task ID. There is no task ID duplication across replicas.
─────────────────────────────────────────────────

So my TaskAlreadyResponded diagnosis was wrong. Let me reconsider what actually causes it. The error was on the respond path, not the
create path. Let me re-examine the exact error.

Searched for 2 patterns (ctrl+o to expand)

⏺ Good catch β€” you're right. The task ID 0x917f8ba3... appears in BOTH the TaskAlreadyResponded revert AND the aggregator expiry on this
same gateway-1. So this isn't a cross-replica race. Let me trace the exact timeline for this one task.

β˜… Insight ─────────────────────────────────────
The TaskAlreadyResponded revert and the "processed 0 signatures" expiry are for the same task ID on the same gateway. This means:

  1. The respond attempt at 17:12:52 failed because the task was already responded to (by a previous successful submission, likely from
    the pre-batch-submitter path or an earlier batch)
  2. The aggregator still had the task registered and it expired at 17:17:37 β€” 5 minutes later β€” with 0 signatures

This suggests the respond is going through the legacy individual send_aggregated_response path (not batched), and the aggregator task
tracking is orphaned β€” the respond happened but the aggregator was never notified of completion.
─────────────────────────────────────────────────

Looking at the flush_respond_batch code (line 704-710), we see:

// Submit each response via the aggregator's existing individual submission path.
// The aggregator handles respondToTask encoding, BLS signature data assembly,
// and on-chain submission via AvsWriter.send_aggregated_response().
//
// TODO: batch respondToTask via BatchTaskManager.batchRespondToTasks() once
// we extract Task + TaskResponse + signatureData from the aggregator's internal
// state without triggering individual on-chain submission.

The respond path is NOT batched. Each respond item goes through aggregator.submit_aggregated_response() individually β€” the same
sequential on-chain submission as before. The batch submitter only batches creates, not responds. The TODO comment confirms this is
known.

So the TaskAlreadyResponded error means: a previous respond for this task already landed on-chain (either from a prior batch flush or
a retry), and the current attempt is a duplicate. This could happen if:

  1. A respond submission succeeded on-chain but the gateway timed out waiting for the receipt (60s timeout) β†’ marked as failed β†’
    re-queued β†’ second attempt hits TaskAlreadyResponded
  2. Multiple respond flushes contain the same task (no dedup on the respond queue)

Revised diagnosis:

The two core issues are:

  1. Batch create receipt timeouts (60s) causing the create pipeline to stall for ~180s (3 retries), which cascades into respond items
    timing out
  2. Respond path is not batched β€” it still submits individually through aggregator.submit_aggregated_response(), which itself has a 60s
    per-attempt timeout. Under load, this serializes respond submissions the same way the old TransactionSubmissionWorker did
  3. The "0 signatures" aggregator expirations β€” these are likely orphaned aggregator tasks from previous failed cycles. The aggregator
    registers a task when BLS collection starts (6-minute expiry), but if the respond fails, the aggregator entry is never cleaned up and
    eventually expires

The TaskAlreadyResponded is a symptom of retry-after-timeout: the tx landed but the receipt was lost β†’ retry β†’ contract says "already
done." The fix should treat TaskAlreadyResponded as idempotent success rather than an error.

Want me to dig into the respond submission path to confirm this theory and scope the fix?

@denniswon
Copy link
Copy Markdown
Author


β˜… Insight ─────────────────────────────────────
The respond path has TWO layers of retry with timeout, creating a cascading retry storm that produces TaskAlreadyResponded reverts.
─────────────────────────────────────────────────

Respond Submission Path β€” Full Trace

Here's the call chain when a respond is submitted:

BatchTransactionSubmitter::flush_respond_batch() [batch_submitter.rs:689]
└─ aggregator.submit_aggregated_response() [task/aggregator.rs:459]
└─ AggregatorCore::submit_aggregated_response() [aggregator/core.rs:974]
└─ avs_writer.send_aggregated_response() [chainio/avs/writer.rs:370]
└─ crate::tx::send_with_retries() [chainio/tx.rs:34]
└─ call_builder.send().await?.get_receipt().await [writer.rs:421-422]

The Problem: Two Independent Retry+Timeout Layers

Layer 1: send_with_retries (tx.rs:34)

  • Timeout: 60s per attempt (TX_SUBMISSION_TIMEOUT_SECS = 60)
  • Retries: 3 attempts (MAX_RETRY_ATTEMPTS = 3)
  • Retry delay: 300ms between attempts
  • On timeout: Retries with same nonce (no nonce bump)
  • Total worst case: ~180s (3 Γ— 60s)

Layer 2: Batch submitter respond accumulator (batch_submitter.rs:634)

  • Timeout: 60s (respond_wait_timeout_ms = 60000) β€” but this is for waiting on create confirmation, not the respond tx itself
  • No retry at this layer for respond failures β€” flush_respond_batch just logs warn and moves on

The TaskAlreadyResponded Root Cause β€” Confirmed

Here's exactly what happens:

  1. Attempt 1: send_with_retries calls call_builder.send().await β†’ tx is broadcast to mempool β†’ get_receipt().await starts polling
  2. Tx lands on-chain β€” respondToTask succeeds, task is marked as responded in the contract
  3. Receipt polling times out at 60s β€” RPC node is slow under load, or mempool state is inconsistent
  4. send_with_retries sees timeout β†’ treats as retryable (line 172-196 in tx.rs)
  5. Attempt 2: send_with_retries calls call_builder.send().await again β†’ broadcasts a NEW tx for the same respondToTask(task,
    taskResponse, signatureData)
  6. Contract reverts with TaskAlreadyResponded(taskId) (selector 0x68905dff) β€” because the task was already responded in step 2
  7. is_retryable_error returns false for this revert (it's not TransportError or PendingTransactionError) β†’ stops retrying

The critical issue is at writer.rs:421-422:

match call_builder.send().await {
Ok(pending_tx) => pending_tx.get_receipt().await.map_err(...)
// ^^^^^ NO timeout wrapper here! Unlike batch_create_tasks which has BATCH_RECEIPT_TIMEOUT

The respond path's send_aggregated_response does NOT have the BATCH_RECEIPT_TIMEOUT wrapper that was added in PR #487 for batch
creates/responds. Instead, it relies on send_with_retries's outer 60s timeout, which wraps BOTH the send() AND the get_receipt()
together. This means:

  • If send() takes 5s and get_receipt() times out at 55s, the tx DID land on-chain
  • The outer timeout fires, send_with_retries retries with a fresh tx
  • The fresh tx hits TaskAlreadyResponded

Impact on Production Load Test

From the Datadog logs, we saw:

  • "respond batch item failed via aggregator" β€” these are the retry failures propagating up
  • TaskAlreadyResponded(0x917f8ba3...) β€” the contract correctly rejecting duplicate responds
  • The same task ID appearing in both the revert and an aggregator expiry β€” because the first attempt's success was invisible (receipt
    was lost to timeout)

Three Fixes (Ordered by Impact)

Fix 1: Treat TaskAlreadyResponded as idempotent success (immediate, low risk)

In AggregatorCore::submit_aggregated_response (core.rs:1117), catch the specific revert and treat it as success:

match result {
Ok(receipt) => { /* existing success path / }
Err(e) => {
// TaskAlreadyResponded means our first attempt succeeded but we lost the receipt
if is_task_already_responded_error(&e) {
warn!(task_id = %task_id, "TaskAlreadyResponded β€” first attempt likely succeeded, treating as success");
// Clean up task state same as success path
self.task_states.remove(&task_id);
// Return a synthetic "success" or convert to Ok
return Ok(/
... */);
}
// ... existing error path
}
}

This prevents the error from propagating up to flush_respond_batch and being logged as a failure.

Fix 2: Add get_receipt() timeout inside send_aggregated_response (medium, prevents root cause)

Same pattern as batch_create_tasks (writer.rs:513-526) β€” wrap get_receipt() in tokio::time::timeout:

// writer.rs:421-422, change from:
Ok(pending_tx) => pending_tx.get_receipt().await.map_err(ChainIoError::AlloyProviderError),

// to:
Ok(pending_tx) => {
match tokio::time::timeout(
Self::BATCH_RECEIPT_TIMEOUT,
pending_tx.get_receipt()
).await {
Ok(Ok(receipt)) => Ok(receipt),
Ok(Err(e)) => Err(ChainIoError::AlloyProviderError(e)),
Err(_) => Err(ChainIoError::TransactionTimeout {
timeout_secs: Self::BATCH_RECEIPT_TIMEOUT.as_secs(),
}),
}
}

This makes the timeout explicit and classifiable, but alone doesn't prevent the retry from hitting TaskAlreadyResponded.

Fix 3: Make send_with_retries nonce-aware for timeout retries (longer term)

After a timeout, the tx may have landed. Before retrying, check eth_getTransactionCount("pending") vs
eth_getTransactionCount("latest"). If they match, the tx was mined β€” query for the receipt instead of retrying.


Recommended Approach

Fix 1 + Fix 2 together β€” they're complementary:

  • Fix 2 prevents get_receipt() from hanging indefinitely inside the individual respond path (same protection batch creates already
    have)
  • Fix 1 handles the case where timeout DID fire and retry hit the contract β€” treats TaskAlreadyResponded as idempotent success

Fix 3 is a deeper refactor of send_with_retries that would benefit all tx paths but is higher risk.

β˜… Insight ─────────────────────────────────────
The asymmetry is key: batch_create_tasks (writer.rs:513) and batch_respond_to_tasks (writer.rs:631) both got the BATCH_RECEIPT_TIMEOUT
wrapper in PR #487. But send_aggregated_response (writer.rs:421) β€” the individual respond path that flush_respond_batch actually
calls β€” was missed. The respond path goes through the old send_with_retries which has its own 60s timeout, but that timeout wraps
send() + get_receipt() together, making it impossible to distinguish "tx sent but receipt lost" from "tx never sent."
─────────────────────────────────────────────────

Want me to implement Fix 1 + Fix 2?

@denniswon
Copy link
Copy Markdown
Author

What's actually happening

When batch_submitter.enabled = true (which is the production default):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Path β”‚ Batched? β”‚ Code path β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Creates β”‚ Yes β€” via β”‚ flush_create_batch β†’ task_submitter.batch_create_tasks() β†’ β”‚
β”‚ β”‚ BatchTaskManager.batchCreateTasks() β”‚ AvsWriter::batch_create_tasks() β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚ β”‚ flush_respond_batch β†’ aggregator.submit_aggregated_response() β†’ β”‚
β”‚ Responds β”‚ No β€” still individual txs β”‚ AggregatorCore::submit_aggregated_response() β†’ β”‚
β”‚ β”‚ β”‚ AvsWriter::send_aggregated_response() β†’ send_with_retries() β†’ individual β”‚
β”‚ β”‚ β”‚ respondToTask β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The TransactionSubmissionWorker (legacy sequential path) is not used when batch submitter is enabled. But
AvsWriter::send_aggregated_response() IS still used β€” flush_respond_batch calls it for each respond item individually in a loop.

The TODO at batch_submitter.rs:708 is about batching responds via BatchTaskManager.batchRespondToTasks() β€” that's a separate future
optimization, not what this PR does.

What this PR actually fixes

The problem is specifically in AvsWriter::send_aggregated_response() (writer.rs:370), which is the individual respond tx path that
flush_respond_batch calls. That method was missing the get_receipt() timeout wrapper. Under load:

  1. send() succeeds (tx broadcast to mempool)
  2. get_receipt() hangs or times out via send_with_retries' outer 60s timeout
  3. send_with_retries retries β†’ new respondToTask tx β†’ contract reverts TaskAlreadyResponded

The batch_create_tasks() and batch_respond_to_tasks() methods on AvsWriter already had the BATCH_RECEIPT_TIMEOUT wrapper from PR #487
β€” but those are the batch contract methods, not the individual send_aggregated_response.

PR description accuracy

The PR description is correct in substance β€” send_aggregated_response was missing the timeout, and the fix adds it. But it could be
more precise about why this path is still used even when batch submitter is enabled.

Let me update the PR description:

⏺ Bash(gh pr edit 488 --body "$(cat <<'EOF'
Production load testing surfaced TaskAlreadyResponded reverts and respond submission failures on all 10 gateway replicas.…)
⎿  https://github.com/newt-foundation/newton-prover-avs/pull/488

⏺ PR description updated. To answer your questions directly:

The TODO (batchRespondToTasks) β€” not implemented in this PR. flush_respond_batch still submits each respond individually through
AvsWriter::send_aggregated_response(). Batching responds via BatchTaskManager.batchRespondToTasks() is a separate optimization that
would reduce tx count from N to 1 per batch, but requires extracting Task + TaskResponse + signatureData from the aggregator's
internal state without triggering individual on-chain submission β€” hence the TODO.

send_aggregated_response usage when batch submitter is enabled β€” yes, it IS still used. The batch submitter only batches creates.
Responds flow through flush_respond_batch β†’ aggregator.submit_aggregated_response() β†’ individual respondToTask txs. The
TransactionSubmissionWorker (legacy) is not used, but AvsWriter::send_aggregated_response() is.

The problem this PR fixes β€” the timeout gap on that individual respond path, plus graceful handling of the resulting
TaskAlreadyResponded when retry discovers the first attempt already succeeded.

@denniswon
Copy link
Copy Markdown
Author

Here is Claude's plan:
β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ
Batch Respond via BatchTaskManager

Context

PR #488 on branch fix/respond-timeout-retry-storm already fixes the respond retry storm (receipt timeout, idempotent
TaskAlreadyResponded handling, reduced retries). But the respond path still submits N individual respondToTask transactions even when
BatchTransactionSubmitter is enabled β€” only creates are batched. Under load testing with 10 gateway replicas on Sepolia, individual
respond txs don't scale. This plan batches responds via BatchTaskManager.batchRespondToTasks() to match the create path.

Key Insight

AggregatorCore::submit_aggregated_response does two things: (1) builds respond params (task lookup, TaskRequest→Task conversion,
signature data generation) and (2) submits on-chain. We need to split these so flush_respond_batch can build params for all items,
then submit in one batch tx.

Files to Modify

  1. crates/gateway/src/task/aggregator.rs β€” Add build_respond_params to AggregatorService trait + Aggregator impl
  2. crates/gateway/src/task/submitter.rs β€” Add batch_respond_to_tasks to TaskSubmitter trait + AvsWriter impl
  3. crates/gateway/src/rpc/api/batch_submitter.rs β€” Rewrite flush_respond_batch to use batch submission
  4. Mock/test impls (6 AggregatorService impls, 8 TaskSubmitter impls) β€” Add stub methods

Implementation Steps

Step 1: Add build_respond_params to AggregatorService trait

File: crates/gateway/src/task/aggregator.rs

Add to the trait (after build_certificate_signature_data):

/// Build the (Task, BindingTaskResponse, signature_data Bytes) triple needed for
/// on-chain respond submission, WITHOUT submitting the transaction.
async fn build_respond_params(
&self,
task_request: TaskRequest,
service_response: BlsAggregationServiceResponse,
) -> Result<(BindingTask, BindingTaskResponse, alloy::primitives::Bytes)>;

Step 2: Implement build_respond_params on Aggregator

File: crates/gateway/src/task/aggregator.rs (impl block starting ~line 598)

Extract the build logic from Aggregator::submit_aggregated_response (lines 459-497):

  • Look up task_response from core.task_states (lines 467-482)
  • Convert TaskRequest β†’ Task (line 485-486)
  • Delegate signature data building to AggregatorCore β€” reuse the branching logic from core.rs:994-1030 (source vs dest chain)

The existing submit_aggregated_response should be refactored to call build_respond_params then submit, keeping backward compat.

For signature data: call self.core.create_signature_data() (sync, source chain) or self.core.create_certificate_signature_data()
(async, dest chain) based on self.core.is_destination_chain. These are the same methods already called inside
AggregatorCore::submit_aggregated_response at lines 994-1030.

Important: After building params, clean up task state (core.task_states.remove(&task_id)) β€” currently done inside
AggregatorCore::submit_aggregated_response at lines 1135/1269.

Step 3: Add batch_respond_to_tasks to TaskSubmitter trait

File: crates/gateway/src/task/submitter.rs

/// Batch-respond to multiple tasks via BatchTaskManager contract.
async fn batch_respond_to_tasks(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result;

Impl for AvsWriter: delegate to AvsWriter::batch_respond_to_tasks (already exists at writer.rs:564), converting ChainIoError to
GatewayError.

Step 4: Rewrite flush_respond_batch

File: crates/gateway/src/rpc/api/batch_submitter.rs

Replace the per-item loop (lines 714-746) with:

  1. Call aggregator.build_respond_params(item.task_request, item.service_response) for each item, collecting (Task, TaskResponse,
    Bytes) triples. Log and skip items that fail to build (same pattern as flush_create_batch_with_retry skipping ABI decode failures).
  2. Call task_submitter.batch_respond_to_tasks(batch_contract_addr, tasks, responses, sig_data) with the collected vectors.
  3. Handle errors with the same strip-and-retry pattern as flush_create_batch_with_retry:
  • BatchPartialFailure: strip failed indices, retry healthy remainder
  • Transient errors: exponential backoff retry of whole batch
  • TaskAlreadyResponded in failure items: treat as success (idempotent)
  • Max retries exceeded: log and drop remaining items
  1. Wrap in AssertUnwindSafe().catch_unwind() for panic isolation (same as create path).

Update signature: flush_respond_batch needs batch_contract_address: Address passed through from respond_accumulator_loop. The batch
contract address is already available in BatchTransactionSubmitter::new() β€” thread it to the loop.

Step 5: Update respond_accumulator_loop to pass batch contract address

File: crates/gateway/src/rpc/api/batch_submitter.rs

The respond_accumulator_loop (line ~587) spawns sub-batches via join_all. Each sub-batch calls flush_respond_batch. Pass
batch_contract_address (already stored on BatchTransactionSubmitter) through to flush_respond_batch.

Step 6: Update mock AggregatorService implementations (6 impls)

Add build_respond_params returning Err(...) or a dummy triple to:

  • NoOpAggregator (batch_submitter.rs:778)
  • StallAggregator (tx_worker.rs:434)
  • MockAggregator (chain_routing.rs:47)
  • MockAggregator (common/mod.rs:43)
  • MockAggregator (send_task.rs:312)
  • DummyAggregator (secrets_encrypted.rs:388)

Step 7: Update mock TaskSubmitter implementations (8 impls)

Add batch_respond_to_tasks returning Ok(dummy_receipt) or appropriate test response to:

  • SuccessSubmitter (batch_submitter.rs:876)
  • FailingSubmitter (batch_submitter.rs:902)
  • PoisonDetectingSubmitter (batch_submitter.rs:931)
  • StallTaskSubmitter (tx_worker.rs:514)
  • MockAvsWriter (chain_routing.rs:227)
  • MockAvsWriter (common/mod.rs:223)
  • MockAvsWriter (send_task.rs:493)
  • DummySubmitter (secrets_encrypted.rs:497)

Verification

  1. make fmt && make clippy β€” no warnings
  2. cargo test -p newton-prover-gateway β€” all gateway unit tests pass (including batch_submitter tests)
  3. make tests β€” full unit test suite
  4. Existing E2E targets (make gateway_e2e, make e2e scenario=basic) cover the respond path end-to-end
    β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ

@denniswon
Copy link
Copy Markdown
Author

❯ one question: when does the build_respond_params gets called? as soon as aggregated response come in/is ready? or when the batch
flush actually happens?

⏺ Good question β€” this is a key design decision.

When the batch flush happens (right before batch_respond_to_tasks), not when the aggregated response arrives.

Here's why:

  1. build_respond_params reads from core.task_states β€” the task response and BLS aggregation data live there. If we build params
    eagerly at queue time, we'd need to either (a) clone all that data into BatchRespondItem upfront, or (b) remove it from task_states
    early, which breaks any concurrent code that might still reference it.
  2. Destination chain signature data is async and expensive β€” create_certificate_signature_data makes RPC calls to query operator
    states and build merkle proofs. Doing this at flush time means we batch the RPC calls closer to submission, reducing the window where
    on-chain state could change between building and submitting.
  3. Matches the create path pattern β€” flush_create_batch_with_retry receives BatchCreateItem with lightweight params and does the heavy
    lifting (ABI encoding, contract call) at flush time.

The flow is:

BLS aggregation completes
β†’ submit_aggregated_response() queues BatchRespondItem (lightweight: task_id, task_request, service_response)
β†’ respond accumulator collects items for batch_interval_ms (default 2s)
β†’ flush_respond_batch() fires:
1. build_respond_params() for each item ← heavy work here
2. batch_respond_to_tasks() with collected triples

One tradeoff: if build_respond_params fails for one item at flush time, we skip it and batch the rest. The alternative (build eagerly,
fail early) would surface errors sooner but complicates BatchRespondItem with large pre-built data.

Want me to proceed with this approach, or would you prefer eager building at queue time?

❯ the reliability/correctness is most important, while not sacrificing on the efficiency/load too much. which way to you recommend?

⏺ I recommend flush-time building (the current plan). Here's the reasoning:

Correctness advantage: task_states is the single source of truth for task responses. Building at flush time means we read from it
exactly once, at the moment we need the data. If we build eagerly, we'd need to handle the case where task_states gets cleaned up
between queue time and flush time (the cleanup_stale_tasks TTL runs periodically). With flush-time building, a missing task_state is a
straightforward skip-and-log.

Reliability advantage: The strip-and-retry pattern works cleanly when build and submit are in the same scope. If build fails for item
3 of 10, we skip it and batch the other 9. If we built eagerly and stored pre-built params in BatchRespondItem, a build failure at
queue time would need different error handling (reject from the channel? store a sentinel?). Keeping it in one place is simpler to
reason about.

Efficiency is fine: The batch interval is 2 seconds. Building params for ~20-120 items (typical batch) takes milliseconds for source
chain (sync create_signature_data) and maybe 100-200ms total for destination chain (async RPC calls, but these could be parallelized
with join_all if needed). This is well within the flush window.

The one risk β€” task_states entry missing at flush time β€” is handled the same way as ABI decode failures in the create path: log a
warning, skip the item, batch the rest. The task response already landed in BLS aggregation, so the data should be there. If it's not,
something else went wrong and retrying won't help.

Let me update the plan and exit.

@denniswon
Copy link
Copy Markdown
Author

the reliability/correctness is most important, while not sacrificing on the efficiency/load too much. which way to you recommend?

⏺ I recommend flush-time building (the current plan). Here's the reasoning:

Correctness advantage: task_states is the single source of truth for task responses. Building at flush time means we read from it
exactly once, at the moment we need the data. If we build eagerly, we'd need to handle the case where task_states gets cleaned up
between queue time and flush time (the cleanup_stale_tasks TTL runs periodically). With flush-time building, a missing task_state is a
straightforward skip-and-log.

Reliability advantage: The strip-and-retry pattern works cleanly when build and submit are in the same scope. If build fails for item
3 of 10, we skip it and batch the other 9. If we built eagerly and stored pre-built params in BatchRespondItem, a build failure at
queue time would need different error handling (reject from the channel? store a sentinel?). Keeping it in one place is simpler to
reason about.

Efficiency is fine: The batch interval is 2 seconds. Building params for ~20-120 items (typical batch) takes milliseconds for source
chain (sync create_signature_data) and maybe 100-200ms total for destination chain (async RPC calls, but these could be parallelized
with join_all if needed). This is well within the flush window.

The one risk β€” task_states entry missing at flush time β€” is handled the same way as ABI decode failures in the create path: log a
warning, skip the item, batch the rest. The task response already landed in BLS aggregation, so the data should be there. If it's not,
something else went wrong and retrying won't help.

Let me update the plan and exit.

Read 1 file (ctrl+o to expand)

⏺ The plan already reflects flush-time building. Let me add a brief note making the timing decision explicit, then exit.

⏺ Updated plan
⎿  /plan to preview
⎿  Tool use rejected

⏺ The strip-and-retry pattern handles partial failures. Here's how it works end-to-end:

On-Chain (Solidity)

BatchTaskManager.batchRespondToTasks() wraps each individual respondToTask in a try/catch. If item 3 out of 10 reverts (e.g.,
TaskAlreadyResponded, TaskMismatch), the contract:

  1. Records item 3 in a FailedItem[] array with its index and revert reason
  2. Continues processing items 4-10
  3. If ANY items failed, reverts the entire tx with BatchPartialFailure(failures) β€” so it's atomic: either all succeed or the whole tx
    reverts with failure details

Rust Side (flush_respond_batch)

The retry loop (same pattern as flush_create_batch_with_retry):

attempt 1: submit [items 0-9]
β†’ BatchPartialFailure([{index: 3, reason: "TaskAlreadyResponded"}, {index: 7, reason: "TaskMismatch"}])

For each failed item:
  - TaskAlreadyResponded β†’ idempotent success, remove from retry set
  - TaskMismatch β†’ non-recoverable, log error, remove from retry set
  - Transient error β†’ keep in retry set

attempt 2: submit [items 0,1,2,4,5,6,8,9] (stripped 3 and 7)
β†’ Ok (all succeed)

Three failure categories:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Category β”‚ Examples β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Idempotent success β”‚ TaskAlreadyResponded β”‚ Strip from batch, count as success β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Non-recoverable β”‚ TaskMismatch, ABI decode failure β”‚ Strip from batch, log error, drop β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Transient β”‚ RPC timeout, nonce conflict, connection error β”‚ Retry whole remaining batch with exponential backoff β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bounded retries

max_retries (default 3) caps the retry loop. After exhausting retries, remaining items are logged and dropped. The catch_unwind
wrapper ensures even a panic in the retry loop doesn't kill the accumulator.

What about the create→respond ordering?

The confirmed_creates DashMap already enforces this β€” a respond item only flushes after its corresponding create is confirmed
on-chain. So by the time we batch-respond, we know the tasks exist on-chain. The main failure mode is TaskAlreadyResponded from a
previous gateway replica that already responded (expected under multi-replica load).

Want me to add more detail to the plan about the failure handling, or is this clear enough to proceed?

@denniswon
Copy link
Copy Markdown
Author

⏺ Now I can see the contract clearly. The "all or nothing" framing is actually wrong β€” let me correct that.

The contract does process all items (try/catch per item), but if ANY fail, it reverts the entire transaction with
BatchPartialFailure(failures). This means successful items in the same batch are also rolled back. The Rust side then strips the
failed items and retries just the healthy ones.

This is a deliberate design β€” here's why:

Why revert-all instead of commit-partial:

  1. Atomicity simplifies nonce management: One tx = one nonce. If the contract committed partial results, the Rust side would need to
    track which items landed and which didn't β€” hard to do reliably under multi-replica load where receipts can be lost.
  2. Failed items carry diagnostic data: The revert payload contains FailedItem[] with indices and revert reasons. If the contract
    committed partial results, there's no standard way to return "these 3 failed" alongside a successful receipt (events could work but
    are harder to parse and not guaranteed to be read).
  3. Strip-and-retry is cheap: The Rust side just removes failed indices from the vectors and resubmits. The gas cost of re-executing
    the successful items is modest compared to the alternative of complex partial-commit bookkeeping in Solidity.
  4. Idempotent failures self-heal: TaskAlreadyResponded means the task IS on-chain from a previous attempt. Stripping it and retrying
    the rest always converges β€” the set of "truly failed" items shrinks each retry.

Let me update the plan with this detailed failure handling section.

⏺ Updated plan
⎿  /plan to preview

⏺ β˜… Insight ─────────────────────────────────────
The revert-all-on-partial-failure pattern is a pragmatic Solidity design for batch operations. The EVM doesn't support "return success
with side-channel failure data" β€” a tx either commits all state changes or reverts all. By collecting failures in memory and
reverting with structured data, the contract gives the caller exactly what it needs to retry intelligently, while keeping the on-chain
logic simple and atomic.
─────────────────────────────────────────────────

Plan updated with the detailed failure handling section. Ready for your approval.

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ready to code?

Here is Claude's plan:
β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ
Batch Respond via BatchTaskManager

Context

PR #488 on branch fix/respond-timeout-retry-storm already fixes the respond retry storm (receipt timeout, idempotent
TaskAlreadyResponded handling, reduced retries). But the respond path still submits N individual respondToTask transactions even when
BatchTransactionSubmitter is enabled β€” only creates are batched. Under load testing with 10 gateway replicas on Sepolia, individual
respond txs don't scale. This plan batches responds via BatchTaskManager.batchRespondToTasks() to match the create path.

Key Insight

AggregatorCore::submit_aggregated_response does two things: (1) builds respond params (task lookup, TaskRequest→Task conversion,
signature data generation) and (2) submits on-chain. We need to split these so flush_respond_batch can build params for all items,
then submit in one batch tx.

Design Decision: Build at Flush Time

build_respond_params is called at flush time (inside flush_respond_batch), NOT when the aggregated response is queued. Reasons:

  • Correctness: task_states is the single source of truth. Reading at flush time avoids stale/missing data from early extraction or TTL
    cleanup between queue and flush.
  • Simplicity: Build failures are handled in the same scope as submit failures (skip item, batch the rest) β€” no sentinel values or
    separate error paths needed in the channel.
  • Efficiency: Building params for a typical batch (20-120 items) takes milliseconds (source chain) or ~100-200ms (destination chain
    async RPC). Well within the 2s flush interval.
  • Risk mitigation: If task_states is missing at flush time, the item is skipped with a warning β€” same pattern as ABI decode failures
    in the create path.

Files to Modify

  1. crates/gateway/src/task/aggregator.rs β€” Add build_respond_params to AggregatorService trait + Aggregator impl
  2. crates/gateway/src/task/submitter.rs β€” Add batch_respond_to_tasks to TaskSubmitter trait + AvsWriter impl
  3. crates/gateway/src/rpc/api/batch_submitter.rs β€” Rewrite flush_respond_batch to use batch submission
  4. Mock/test impls (6 AggregatorService impls, 8 TaskSubmitter impls) β€” Add stub methods

Implementation Steps

Step 1: Add build_respond_params to AggregatorService trait

File: crates/gateway/src/task/aggregator.rs

Add to the trait (after build_certificate_signature_data):

/// Build the (Task, BindingTaskResponse, signature_data Bytes) triple needed for
/// on-chain respond submission, WITHOUT submitting the transaction.
async fn build_respond_params(
&self,
task_request: TaskRequest,
service_response: BlsAggregationServiceResponse,
) -> Result<(BindingTask, BindingTaskResponse, alloy::primitives::Bytes)>;

Step 2: Implement build_respond_params on Aggregator

File: crates/gateway/src/task/aggregator.rs (impl block starting ~line 598)

Extract the build logic from Aggregator::submit_aggregated_response (lines 459-497):

  • Look up task_response from core.task_states (lines 467-482)
  • Convert TaskRequest β†’ Task (line 485-486)
  • Delegate signature data building to AggregatorCore β€” reuse the branching logic from core.rs:994-1030 (source vs dest chain)

The existing submit_aggregated_response should be refactored to call build_respond_params then submit, keeping backward compat.

For signature data: call self.core.create_signature_data() (sync, source chain) or self.core.create_certificate_signature_data()
(async, dest chain) based on self.core.is_destination_chain. These are the same methods already called inside
AggregatorCore::submit_aggregated_response at lines 994-1030.

Important: After building params, clean up task state (core.task_states.remove(&task_id)) β€” currently done inside
AggregatorCore::submit_aggregated_response at lines 1135/1269.

Step 3: Add batch_respond_to_tasks to TaskSubmitter trait

File: crates/gateway/src/task/submitter.rs

/// Batch-respond to multiple tasks via BatchTaskManager contract.
async fn batch_respond_to_tasks(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result;

Impl for AvsWriter: delegate to AvsWriter::batch_respond_to_tasks (already exists at writer.rs:564), converting ChainIoError to
GatewayError.

Step 4: Rewrite flush_respond_batch

File: crates/gateway/src/rpc/api/batch_submitter.rs

Replace the per-item loop (lines 714-746) with:

  1. Call aggregator.build_respond_params(item.task_request, item.service_response) for each item, collecting (Task, TaskResponse,
    Bytes) triples. Log and skip items that fail to build (same pattern as flush_create_batch_with_retry skipping ABI decode failures).
  2. Call task_submitter.batch_respond_to_tasks(batch_contract_addr, tasks, responses, sig_data) with the collected vectors.
  3. Handle errors with the same strip-and-retry pattern as flush_create_batch_with_retry:
  • BatchPartialFailure: strip failed indices, retry healthy remainder
  • Transient errors: exponential backoff retry of whole batch
  • TaskAlreadyResponded in failure items: treat as success (idempotent)
  • Max retries exceeded: log and drop remaining items
  1. Wrap in AssertUnwindSafe().catch_unwind() for panic isolation (same as create path).

Update signature: flush_respond_batch needs batch_contract_address: Address passed through from respond_accumulator_loop. The batch
contract address is already available in BatchTransactionSubmitter::new() β€” thread it to the loop.

Step 5: Update respond_accumulator_loop to pass batch contract address

File: crates/gateway/src/rpc/api/batch_submitter.rs

The respond_accumulator_loop (line ~587) spawns sub-batches via join_all. Each sub-batch calls flush_respond_batch. Pass
batch_contract_address (already stored on BatchTransactionSubmitter) through to flush_respond_batch.

Step 6: Update mock AggregatorService implementations (6 impls)

Add build_respond_params returning Err(...) or a dummy triple to:

  • NoOpAggregator (batch_submitter.rs:778)
  • StallAggregator (tx_worker.rs:434)
  • MockAggregator (chain_routing.rs:47)
  • MockAggregator (common/mod.rs:43)
  • MockAggregator (send_task.rs:312)
  • DummyAggregator (secrets_encrypted.rs:388)

Step 7: Update mock TaskSubmitter implementations (8 impls)

Add batch_respond_to_tasks returning Ok(dummy_receipt) or appropriate test response to:

  • SuccessSubmitter (batch_submitter.rs:876)
  • FailingSubmitter (batch_submitter.rs:902)
  • PoisonDetectingSubmitter (batch_submitter.rs:931)
  • StallTaskSubmitter (tx_worker.rs:514)
  • MockAvsWriter (chain_routing.rs:227)
  • MockAvsWriter (common/mod.rs:223)
  • MockAvsWriter (send_task.rs:493)
  • DummySubmitter (secrets_encrypted.rs:497)

Failure Handling: Strip-and-Retry

Why the Contract Reverts the Entire Batch on Partial Failure

BatchTaskManager.batchRespondToTasks() (BatchTaskManager.sol:103-140) wraps each respondToTask in try/catch. If ANY item reverts, the
contract collects all failures into FailedItem[] and reverts the entire tx with BatchPartialFailure(trimmed). Successful items in the
same batch are rolled back too.

This revert-all design is intentional:

  1. Atomicity simplifies nonce management: One tx = one nonce. If the contract committed partial results, the Rust side would need to
    track which items landed vs didn't β€” unreliable under multi-replica load where receipts can timeout.
  2. Failed items carry diagnostic data: The revert payload contains FailedItem(index, taskId, reason) with the exact revert bytes. If
    the contract committed partial results, there's no clean way to return "these 3 failed" in a successful receipt. Events could work but
    are harder to parse.
  3. Strip-and-retry is cheap: Re-executing the healthy items costs modest gas compared to the complexity of partial-commit bookkeeping
    in Solidity. The batch is already smaller each retry.
  4. Idempotent failures self-heal: TaskAlreadyResponded means the task IS already on-chain (from a previous replica or timeout retry).
    Stripping it and retrying the rest always converges.

Rust-Side Retry Loop

The flush_respond_batch retry loop mirrors flush_create_batch_with_retry:

items = [all built (Task, TaskResponse, Bytes) triples]

for attempt in 0..max_retries:
result = task_submitter.batch_respond_to_tasks(batch_addr, tasks, responses, sigs)

 match result:
     Ok(receipt) β†’ done, all items succeeded

     Err(BatchPartialFailure(failures)) β†’
         for each failure:
             if reason is TaskAlreadyResponded β†’ count as success, strip from batch
             if reason is non-recoverable      β†’ log error, strip from batch
             if reason is transient            β†’ keep in batch for retry
         if batch is now empty β†’ done
         else β†’ exponential backoff, continue loop

     Err(transient_error) β†’
         exponential backoff, retry entire batch

     Err(non_transient_error) β†’
         log, drop all remaining items, break

 if max_retries exceeded β†’ log remaining items as dropped

Three Failure Categories

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Category β”‚ Selector / Pattern β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Idempotent success β”‚ TaskAlreadyResponded (0x68905dff) β”‚ Strip from batch, count as success β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Non-recoverable β”‚ TaskMismatch, ABI decode failure, unknown selector β”‚ Strip from batch, log error, drop β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Transient β”‚ RPC timeout, nonce conflict, connection error β”‚ Retry whole remaining batch with backoff β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Panic Isolation

The entire retry loop is wrapped in AssertUnwindSafe(...).catch_unwind() so a panic in signature data building or ABI encoding doesn't
kill the accumulator loop. On panic, the batch is logged and dropped, and the loop continues processing future batches.

Create→Respond Ordering Invariant

The confirmed_creates DashMap ensures respond items only flush after their corresponding creates are confirmed on-chain. By the time
flush_respond_batch runs, the tasks exist on-chain. The primary failure mode under load is TaskAlreadyResponded from another gateway
replica that already responded β€” handled as idempotent success.

Verification

  1. make fmt && make clippy β€” no warnings
  2. cargo test -p newton-prover-gateway β€” all gateway unit tests pass (including batch_submitter tests)
  3. make tests β€” full unit test suite
  4. Existing E2E targets (make gateway_e2e, make e2e scenario=basic) cover the respond path end-to-end

@denniswon
Copy link
Copy Markdown
Author

❯ yes more detailed to the plan about the failure handling. But another question: why is all or nothing approach better? explain in the
plan aswell

⏺ Good question. Let me think about this carefully.

The "all or nothing" isn't actually a choice we made β€” it's how the Solidity batchRespondToTasks contract works. Let me verify this is
actually the case.

Searched for 1 pattern, read 1 file (ctrl+o to expand)
⎿  Loaded contracts/CLAUDE.md
⎿  Loaded contracts/src/CLAUDE.md
⎿  Loaded .claude/rules/solidity.md

⏺ Now I can see the contract clearly. The "all or nothing" framing is actually wrong β€” let me correct that.

The contract does process all items (try/catch per item), but if ANY fail, it reverts the entire transaction with
BatchPartialFailure(failures). This means successful items in the same batch are also rolled back. The Rust side then strips the
failed items and retries just the healthy ones.

This is a deliberate design β€” here's why:

Why revert-all instead of commit-partial:

  1. Atomicity simplifies nonce management: One tx = one nonce. If the contract committed partial results, the Rust side would need to
    track which items landed and which didn't β€” hard to do reliably under multi-replica load where receipts can be lost.
  2. Failed items carry diagnostic data: The revert payload contains FailedItem[] with indices and revert reasons. If the contract
    committed partial results, there's no standard way to return "these 3 failed" alongside a successful receipt (events could work but
    are harder to parse and not guaranteed to be read).
  3. Strip-and-retry is cheap: The Rust side just removes failed indices from the vectors and resubmits. The gas cost of re-executing
    the successful items is modest compared to the alternative of complex partial-commit bookkeeping in Solidity.
  4. Idempotent failures self-heal: TaskAlreadyResponded means the task IS on-chain from a previous attempt. Stripping it and retrying
    the rest always converges β€” the set of "truly failed" items shrinks each retry.

Let me update the plan with this detailed failure handling section.

⏺ Updated plan
⎿  /plan to preview

⏺ β˜… Insight ─────────────────────────────────────
The revert-all-on-partial-failure pattern is a pragmatic Solidity design for batch operations. The EVM doesn't support "return success
with side-channel failure data" β€” a tx either commits all state changes or reverts all. By collecting failures in memory and
reverting with structured data, the contract gives the caller exactly what it needs to retry intelligently, while keeping the on-chain
logic simple and atomic.
─────────────────────────────────────────────────

Plan updated with the detailed failure handling section. Ready for your approval.
⎿  Tool use rejected

⏺ Good instinct to question this. Let me think through the options:

Option 1: Log and Drop (current plan)

  • Failed items are logged with full context (task_id, error reason) and discarded
  • Simple, no additional infrastructure
  • Downside: permanently lost responds β€” the task stays unresponded on-chain

Option 2: Dead Letter Queue (DLQ)

  • Failed items are pushed to a secondary bounded channel/queue for manual inspection or delayed retry
  • Could be in-memory (mpsc::channel) or Redis list for persistence across restarts
  • Downside: adds complexity, and the question is β€” who retries from the DLQ? A human? Another loop?

Option 3: Re-queue to the respond accumulator

  • Failed transient items go back into respond_tx channel for the next batch cycle
  • Needs a retry_count field on BatchRespondItem to prevent infinite loops
  • Downside: risk of poisoning future batches if the failure is semi-permanent

My recommendation: Option 3 (re-queue) for transient errors, log-and-drop for non-recoverable

Here's why:

  • A DLQ is overkill at this stage. The main consumers of a DLQ would be operators monitoring it β€” but we don't have that operational
    infrastructure yet. A Redis DLQ also adds a new failure mode (Redis down β†’ DLQ writes fail β†’ now what?).
  • Re-queuing transient failures back to the respond channel is cheap and self-healing. The item gets another chance in the next batch
    cycle (2s later). Add retry_count: u8 to BatchRespondItem and cap it (e.g., 3 re-queues). After that, log-and-drop.
  • Non-recoverable failures (TaskMismatch, ABI decode errors) should never be re-queued β€” they'll fail every time. Log with full
    context so Datadog alerts can fire.
  • TaskAlreadyResponded is already handled as success β€” not a concern here.

The practical impact of a dropped respond: the task stays unresponded on-chain until the challenge window expires. For load testing
this is acceptable. For production, we'd add Datadog alerting on the "respond batch item permanently dropped" log pattern, and the
TaskAlreadyResponded idempotency means most "failures" under multi-replica load are actually successes.

Want me to update the plan with the re-queue approach for transient errors?

@denniswon
Copy link
Copy Markdown
Author

let's just put a TODO in the code for dead letter queue consideration

⏺ Makes sense β€” keep it simple now, add a TODO for future DLQ consideration. Let me update the plan.

⏺ Updated plan
⎿  /plan to preview

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ready to code?

Here is Claude's plan:
β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ
Batch Respond via BatchTaskManager

Context

PR #488 on branch fix/respond-timeout-retry-storm already fixes the respond retry storm (receipt timeout, idempotent
TaskAlreadyResponded handling, reduced retries). But the respond path still submits N individual respondToTask transactions even when
BatchTransactionSubmitter is enabled β€” only creates are batched. Under load testing with 10 gateway replicas on Sepolia, individual
respond txs don't scale. This plan batches responds via BatchTaskManager.batchRespondToTasks() to match the create path.

Key Insight

AggregatorCore::submit_aggregated_response does two things: (1) builds respond params (task lookup, TaskRequest→Task conversion,
signature data generation) and (2) submits on-chain. We need to split these so flush_respond_batch can build params for all items,
then submit in one batch tx.

Design Decision: Build at Flush Time

build_respond_params is called at flush time (inside flush_respond_batch), NOT when the aggregated response is queued. Reasons:

  • Correctness: task_states is the single source of truth. Reading at flush time avoids stale/missing data from early extraction or TTL
    cleanup between queue and flush.
  • Simplicity: Build failures are handled in the same scope as submit failures (skip item, batch the rest) β€” no sentinel values or
    separate error paths needed in the channel.
  • Efficiency: Building params for a typical batch (20-120 items) takes milliseconds (source chain) or ~100-200ms (destination chain
    async RPC). Well within the 2s flush interval.
  • Risk mitigation: If task_states is missing at flush time, the item is skipped with a warning β€” same pattern as ABI decode failures
    in the create path.

Files to Modify

  1. crates/gateway/src/task/aggregator.rs β€” Add build_respond_params to AggregatorService trait + Aggregator impl
  2. crates/gateway/src/task/submitter.rs β€” Add batch_respond_to_tasks to TaskSubmitter trait + AvsWriter impl
  3. crates/gateway/src/rpc/api/batch_submitter.rs β€” Rewrite flush_respond_batch to use batch submission
  4. Mock/test impls (6 AggregatorService impls, 8 TaskSubmitter impls) β€” Add stub methods

Implementation Steps

Step 1: Add build_respond_params to AggregatorService trait

File: crates/gateway/src/task/aggregator.rs

Add to the trait (after build_certificate_signature_data):

/// Build the (Task, BindingTaskResponse, signature_data Bytes) triple needed for
/// on-chain respond submission, WITHOUT submitting the transaction.
async fn build_respond_params(
&self,
task_request: TaskRequest,
service_response: BlsAggregationServiceResponse,
) -> Result<(BindingTask, BindingTaskResponse, alloy::primitives::Bytes)>;

Step 2: Implement build_respond_params on Aggregator

File: crates/gateway/src/task/aggregator.rs (impl block starting ~line 598)

Extract the build logic from Aggregator::submit_aggregated_response (lines 459-497):

  • Look up task_response from core.task_states (lines 467-482)
  • Convert TaskRequest β†’ Task (line 485-486)
  • Delegate signature data building to AggregatorCore β€” reuse the branching logic from core.rs:994-1030 (source vs dest chain)

The existing submit_aggregated_response should be refactored to call build_respond_params then submit, keeping backward compat.

For signature data: call self.core.create_signature_data() (sync, source chain) or self.core.create_certificate_signature_data()
(async, dest chain) based on self.core.is_destination_chain. These are the same methods already called inside
AggregatorCore::submit_aggregated_response at lines 994-1030.

Important: After building params, clean up task state (core.task_states.remove(&task_id)) β€” currently done inside
AggregatorCore::submit_aggregated_response at lines 1135/1269.

Step 3: Add batch_respond_to_tasks to TaskSubmitter trait

File: crates/gateway/src/task/submitter.rs

/// Batch-respond to multiple tasks via BatchTaskManager contract.
async fn batch_respond_to_tasks(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result;

Impl for AvsWriter: delegate to AvsWriter::batch_respond_to_tasks (already exists at writer.rs:564), converting ChainIoError to
GatewayError.

Step 4: Rewrite flush_respond_batch

File: crates/gateway/src/rpc/api/batch_submitter.rs

Replace the per-item loop (lines 714-746) with:

  1. Call aggregator.build_respond_params(item.task_request, item.service_response) for each item, collecting (Task, TaskResponse,
    Bytes) triples. Log and skip items that fail to build (same pattern as flush_create_batch_with_retry skipping ABI decode failures).
  2. Call task_submitter.batch_respond_to_tasks(batch_contract_addr, tasks, responses, sig_data) with the collected vectors.
  3. Handle errors with the same strip-and-retry pattern as flush_create_batch_with_retry:
  • BatchPartialFailure: strip failed indices, retry healthy remainder
  • Transient errors: exponential backoff retry of whole batch
  • TaskAlreadyResponded in failure items: treat as success (idempotent)
  • Max retries exceeded: log and drop remaining items
  1. Wrap in AssertUnwindSafe().catch_unwind() for panic isolation (same as create path).

Update signature: flush_respond_batch needs batch_contract_address: Address passed through from respond_accumulator_loop. The batch
contract address is already available in BatchTransactionSubmitter::new() β€” thread it to the loop.

Step 5: Update respond_accumulator_loop to pass batch contract address

File: crates/gateway/src/rpc/api/batch_submitter.rs

The respond_accumulator_loop (line ~587) spawns sub-batches via join_all. Each sub-batch calls flush_respond_batch. Pass
batch_contract_address (already stored on BatchTransactionSubmitter) through to flush_respond_batch.

Step 6: Update mock AggregatorService implementations (6 impls)

Add build_respond_params returning Err(...) or a dummy triple to:

  • NoOpAggregator (batch_submitter.rs:778)
  • StallAggregator (tx_worker.rs:434)
  • MockAggregator (chain_routing.rs:47)
  • MockAggregator (common/mod.rs:43)
  • MockAggregator (send_task.rs:312)
  • DummyAggregator (secrets_encrypted.rs:388)

Step 7: Update mock TaskSubmitter implementations (8 impls)

Add batch_respond_to_tasks returning Ok(dummy_receipt) or appropriate test response to:

  • SuccessSubmitter (batch_submitter.rs:876)
  • FailingSubmitter (batch_submitter.rs:902)
  • PoisonDetectingSubmitter (batch_submitter.rs:931)
  • StallTaskSubmitter (tx_worker.rs:514)
  • MockAvsWriter (chain_routing.rs:227)
  • MockAvsWriter (common/mod.rs:223)
  • MockAvsWriter (send_task.rs:493)
  • DummySubmitter (secrets_encrypted.rs:497)

Failure Handling: Strip-and-Retry

Why the Contract Reverts the Entire Batch on Partial Failure

BatchTaskManager.batchRespondToTasks() (BatchTaskManager.sol:103-140) wraps each respondToTask in try/catch. If ANY item reverts, the
contract collects all failures into FailedItem[] and reverts the entire tx with BatchPartialFailure(trimmed). Successful items in the
same batch are rolled back too.

This revert-all design is intentional:

  1. Atomicity simplifies nonce management: One tx = one nonce. If the contract committed partial results, the Rust side would need to
    track which items landed vs didn't β€” unreliable under multi-replica load where receipts can timeout.
  2. Failed items carry diagnostic data: The revert payload contains FailedItem(index, taskId, reason) with the exact revert bytes. If
    the contract committed partial results, there's no clean way to return "these 3 failed" in a successful receipt. Events could work but
    are harder to parse.
  3. Strip-and-retry is cheap: Re-executing the healthy items costs modest gas compared to the complexity of partial-commit bookkeeping
    in Solidity. The batch is already smaller each retry.
  4. Idempotent failures self-heal: TaskAlreadyResponded means the task IS already on-chain (from a previous replica or timeout retry).
    Stripping it and retrying the rest always converges.

Rust-Side Retry Loop

The flush_respond_batch retry loop mirrors flush_create_batch_with_retry:

items = [all built (Task, TaskResponse, Bytes) triples]

for attempt in 0..max_retries:
result = task_submitter.batch_respond_to_tasks(batch_addr, tasks, responses, sigs)

 match result:
     Ok(receipt) β†’ done, all items succeeded

     Err(BatchPartialFailure(failures)) β†’
         for each failure:
             if reason is TaskAlreadyResponded β†’ count as success, strip from batch
             if reason is non-recoverable      β†’ log error, strip from batch
             if reason is transient            β†’ keep in batch for retry
         if batch is now empty β†’ done
         else β†’ exponential backoff, continue loop

     Err(transient_error) β†’
         exponential backoff, retry entire batch

     Err(non_transient_error) β†’
         log, drop all remaining items, break

 if max_retries exceeded β†’ log remaining items as dropped

After max retries, permanently dropped items are logged at error! level with full context (task_id, attempt count, last error). Add a
TODO comment for dead letter queue consideration:

// TODO: consider a dead letter queue (Redis list or bounded in-memory channel)
// for permanently failed respond items, enabling delayed retry or manual
// inspection instead of dropping. For now, Datadog alerting on
// "respond batch item permanently dropped" covers production visibility.

Three Failure Categories

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Category β”‚ Selector / Pattern β”‚ Action β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Idempotent success β”‚ TaskAlreadyResponded (0x68905dff) β”‚ Strip from batch, count as success β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Non-recoverable β”‚ TaskMismatch, ABI decode failure, unknown selector β”‚ Strip from batch, log error, drop β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Transient β”‚ RPC timeout, nonce conflict, connection error β”‚ Retry whole remaining batch with backoff β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Panic Isolation

The entire retry loop is wrapped in AssertUnwindSafe(...).catch_unwind() so a panic in signature data building or ABI encoding doesn't
kill the accumulator loop. On panic, the batch is logged and dropped, and the loop continues processing future batches.

Create→Respond Ordering Invariant

The confirmed_creates DashMap ensures respond items only flush after their corresponding creates are confirmed on-chain. By the time
flush_respond_batch runs, the tasks exist on-chain. The primary failure mode under load is TaskAlreadyResponded from another gateway
replica that already responded β€” handled as idempotent success.

Verification

  1. make fmt && make clippy β€” no warnings
  2. cargo test -p newton-prover-gateway β€” all gateway unit tests pass (including batch_submitter tests)
  3. make tests β€” full unit test suite
  4. Existing E2E targets (make gateway_e2e, make e2e scenario=basic) cover the respond path end-to-end

@denniswon
Copy link
Copy Markdown
Author

Now I have the full picture. Line 364-368 confirms it: policy_params is Bytes from the on-chain contract. When it's empty (0x),
String::from_utf8 produces "", then serde_json::from_str("") fails with "EOF while parsing a value at line 1 column 0". Line 369-374
already handles the case where identity_domain key is missing from the JSON β€” but it never gets there because the parse fails first.


Load Test Failure Analysis

The logs reveal three interrelated failure modes, but they all cascade from two root causes.

Root Cause 1: Empty policy_params crashes ALL operator evaluations

The dominant failure. Every single quorum failure in the logs shows both operators (0x03f2... on 0x84c2... and 0xfd8c... on 0x0b26...)
returning the same error:

Error processing identity data: Identity fetch error: Error deserializing
identity_domain: policy_params JSON is not valid policyParams object:
EOF while parsing a value at line 1 column 0

What's happening:

  1. The gateway calls get_policy_config_for_client() on-chain, which returns PolicyConfig { policyParams, expireAfter }
  2. policyParams is empty bytes (0x) for the policy client being load-tested
  3. This gets passed to operators in ConsensusCommitRequest.policy_config
  4. Operator calls fetch_identity_data() at crates/chainio/src/identity_data.rs:364
  5. String::from_utf8(empty_bytes) produces ""
  6. serde_json::from_str("") fails β€” serde can't parse an empty string as JSON
  7. Error propagates up as -32603 RPC error back to gateway
  8. Both operators fail β†’ quorum is 0% β†’ every task fails with "Quorum NOT reached"

The irony: Lines 369-374 already handle the "no identity domain" case gracefully by returning Ok(None). But the code never reaches
that check because parsing the empty string fails first.

This is a data validation gap, not a load issue. It would fail at 1 request/sec too. The load test just made it visible across many
tasks.

Root Cause 2: Nonce contention across 10 gateway replicas

All 10 gateway replicas (gateway-1 through gateway-10) share the same ECDSA signer key. Each replica has its own independent
CachedNonceManager. When multiple replicas flush batch creates simultaneously:

  1. All replicas query eth_getTransactionCount("pending") independently
  2. Multiple replicas get the same nonce
  3. Only one succeeds; others get "nonce too low"
  4. is_transient_error() matches "nonce" β†’ retries with backoff
  5. After 3 retries (100ms β†’ 200ms β†’ 400ms), it gives up: "max transient retries exceeded"

This explains the batch create: transient error β†’ max transient retries exceeded pattern across gateways 1, 4, 7, 9.

Root Cause 3 (Downstream): Zombie aggregator tasks with 0 signatures

The massive wave of "task expired after 360001 ms - processed 0 signatures" entries are downstream effects. When centralized broadcast
fails (quorum not reached due to Root Cause 1), the TaskAggregator still has an active aggregation slot waiting for BLS signatures.
No operator ever submits a signature, so it sits for the full 360s timeout before logging expiration. These are wasted resources, not
an independent problem.


Proposed Solutions

Fix 1 (Critical): Handle empty policy_params gracefully

β˜… Insight ─────────────────────────────────────
The fix is simple: if policy_params is empty bytes, treat it as "no policy params configured" and skip identity data processing β€” the
same way the code already handles missing identity_domain at line 371-374. This is a data boundary validation issue, not a logic
error.
─────────────────────────────────────────────────

In crates/chainio/src/identity_data.rs, add an early return before line 364:

// Empty policy_params means no policy config set β€” skip identity processing
if ctx.policy_params.is_empty() {
info!("Empty policy_params, proceeding without identity data.");
return Ok(None);
}

This is a one-line fix that eliminates the dominant failure mode.

Fix 2 (High): Per-replica signer keys to eliminate nonce contention

The nonce contention is a fundamental architecture problem when N replicas share one signer. Three options, from simplest to most
robust:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Approach β”‚ Complexity β”‚ Throughput β”‚ Description β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Per-replica signer β”‚ Low β”‚ N x current β”‚ Each replica gets its own registered task generator key. Deploy repo β”‚
β”‚ keys β”‚ β”‚ β”‚ assigns PRIVATE_KEY_1 through PRIVATE_KEY_N. β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Signer pool per β”‚ Medium β”‚ M x N x β”‚ Each replica round-robins M keys (the TODO in batch_submitter.rs:11-22). β”‚
β”‚ replica β”‚ β”‚ current β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Centralized nonce β”‚ High β”‚ Same as pool β”‚ Redis-backed atomic nonce counter. Over-engineered for current scale. β”‚
β”‚ coordinator β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Per-replica signer keys is the right first step. It eliminates contention entirely with no code changes β€” just config/deploy changes.
Each key needs to be registered as a task generator on-chain.

Fix 3 (Medium): Cancel aggregator on quorum failure

When centralized broadcast fails with "Quorum NOT reached", the system should cancel the corresponding TaskAggregator immediately
rather than letting it sit for 360s. This eliminates the zombie aggregation slots consuming memory.

Fix 4 (Low): Log the actual transient error in batch create failure message

The WARN log at batch_submitter.rs:430 includes error = %e in the structured fields, but the Datadog CSV only shows the message string
"batch create: transient error, retrying with backoff" without the actual error. Either:

  • Include the error in the message itself: "batch create: transient error ({e}), retrying with backoff", or
  • Verify that structured fields are being indexed in Datadog

Impact Assessment

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Root Cause β”‚ Tasks Affected β”‚ Fix Priority β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Empty policy_params β”‚ 100% of quorum failures β”‚ P0 β€” fix immediately β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Nonce contention β”‚ ~30% of batch creates fail β”‚ P1 β€” per-replica keys β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Zombie aggregators β”‚ Resource waste, not data loss β”‚ P2 β€” cancel on failure β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The empty policy_params fix alone would resolve every quorum failure in this log. The nonce contention fix is needed separately for
batch create reliability at scale. Want me to implement Fix 1 now?

@denniswon
Copy link
Copy Markdown
Author

Good β€” so GATEWAY__TASK_GENERATOR__SIGNER__PRIVATE_KEY env var overrides the TOML config's gateway.task_generator.signer.private_key
via the __ separator convention. Each ECS service gets a different value from AWS Secrets Manager (TASK_GENERATOR_KEY_1 through
TASK_GENERATOR_KEY_10), injected as the same env var name.

This means each replica has its own independent nonce space. My original Root Cause 2 (nonce contention across replicas) is wrong. Let
me revise the analysis.


Revised Load Test Failure Analysis

Root Cause 1 (P0): Empty policy_params crashes ALL operator evaluations

This is unchanged and remains the dominant issue. Every quorum failure in the logs traces back to:

policy_params JSON is not valid policyParams object: EOF while parsing a value at line 1 column 0

At crates/chainio/src/identity_data.rs:364-368, empty policyParams bytes from the on-chain PolicyConfig are converted to "", which
serde_json::from_str can't parse. Both operators fail, quorum is 0%, every task fails.

The fix is a single guard before line 364:

if ctx.policy_params.is_empty() {
info!("Empty policy_params, proceeding without identity data.");
return Ok(None);
}

Root Cause 2 (Revised): Batch create transient errors β€” NOT nonce contention

Since each replica has its own signer key, the "batch create: transient error" pattern is not caused by cross-replica nonce races. The
actual cause is likely one of:

  1. RPC provider rate limiting or instability under load β€” 10 replicas simultaneously hitting Sepolia RPC endpoints (Tenderly gateway)
    could trigger rate limits (429) or internal errors (503)
  2. Per-replica nonce races between aggregator + batch submitter β€” within a single replica, the AvsWriter is shared via
    new_with_provider (shared nonce manager), but if there's a code path creating a second provider instance, intra-replica nonce conflict
    is possible
  3. Sepolia mempool congestion β€” under load testing, many batch txs from 10 different signers could cause mempool-level issues

The Datadog CSV doesn't show the actual error. The structured field error = %e is logged at WARN level (batch_submitter.rs:432), but
the CSV only captures the message template. To diagnose this precisely, we need the structured error field from Datadog β€” either via
the Datadog UI facets or by adding the error to the message string itself.

β˜… Insight ─────────────────────────────────────
The transient batch create errors are a secondary issue β€” they only affect on-chain task creation, not the evaluation path. Even if
batch creates succeed, the tasks would still fail quorum because of Root Cause 1 (empty policy_params). Fixing Root Cause 1 first is
critical because it blocks ALL task evaluations regardless of batch submission health.
─────────────────────────────────────────────────

Root Cause 3: Zombie aggregator tasks (360s timeout with 0 signatures)

Unchanged β€” these are the downstream effect of quorum failure. When centralized broadcast fails, the TaskAggregator slot sits idle for
360s. The fix is to cancel the aggregator immediately when quorum fails.


Revised Priority

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Issue β”‚ Impact β”‚ Fix β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ P0 β”‚ Empty policy_params β†’ operator β”‚ 100% of tasks fail β”‚ Early return in identity_data.rs if β”‚
β”‚ β”‚ deserialization crash β”‚ quorum β”‚ policy_params.is_empty() β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ P1 β”‚ Batch create transient errors (unknown β”‚ ~30% of batch creates β”‚ Surface actual error in log message; investigate RPC β”‚
β”‚ β”‚ RPC cause) β”‚ fail β”‚ provider limits β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ P2 β”‚ Zombie aggregator slots (360s idle) β”‚ Memory/resource waste β”‚ Cancel aggregator on quorum failure β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The empty policy_params fix is a one-liner that eliminates every quorum failure in this log. The batch create issue needs better
observability (surface the actual error string) before we can solve it. Want me to implement the P0 fix now?

@denniswon
Copy link
Copy Markdown
Author

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Claim β”‚ Status β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ fetch_identity_data crashes on empty policy_params β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Guard returns Ok(None) before JSON parse β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3 β”‚ AggregatorCore DashMap fields β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4 β”‚ cancel_task removes from both maps β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ Both quorum failure paths call cancel_task β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6 β”‚ Trait method added to AggregatorService β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7 β”‚ 6 mock impls + 1 real impl = 7 total β”‚ VERIFIED β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 8 β”‚ Error inlined in batch log message strings β”‚ VERIFIED β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

No false statements found. PR #492 is ready: https://github.com/newt-foundation/newton-prover-avs/pull/492

✻ Worked for 1m 21s

❯ /docs-sync

⏺ Let me run through the docs-sync phases.

⏺ Bash(git log --oneline $(git merge-base HEAD main)..HEAD)
⎿  4604667f fix: guard empty policy_params, surface batch errors, cancel zombie aggregators

⏺ Bash(git diff $(git merge-base HEAD main)..HEAD --stat)
⎿   crates/aggregator/src/core.rs | 10 ++++++++++
crates/chainio/src/identity_data.rs | 7 +++++++
crates/gateway/src/rpc/api/batch_submitter.rs | 13 +++++--------
… +8 lines (ctrl+o to expand)

⏺ Single commit, 10 files. The changes are: (1) empty policy_params guard, (2) batch error log formatting, (3) cancel_task trait method

  • impls. Let me check if any docs reference these areas.

⏺ Bash(git diff $(git merge-base HEAD main)..HEAD)
⎿  diff --git a/crates/aggregator/src/core.rs b/crates/aggregator/src/core.rs
index 792238fe..64dcd2b8 100644
--- a/crates/aggregator/src/core.rs
… +78 lines (ctrl+o to expand)

⏺ Now let me check if any documentation references these areas and needs updating.

Searched for 2 patterns (ctrl+o to expand)

⏺ The docs reference fetch_identity_data and policy_params but describe the flow correctly β€” they mention parsing policy_params for
identity_domain. The new guard is a defensive check that happens before that parsing and doesn't change the documented flow. No doc
updates needed for the identity path.

Now check if lessons.md should get a new entry β€” this is the kind of production gotcha that belongs there:

Searched for 2 patterns (ctrl+o to expand)

⏺ No existing entry for this. Let me add it to lessons.md and the Common Pitfalls table in CLAUDE.md. Let me also check the Common
Pitfalls table format:

Read 2 files (ctrl+o to expand)
⎿  Loaded .claude/CLAUDE.md
⎿  Loaded .claude/rules/CLAUDE.md

⏺ Good. I need to add two things:

  1. A new Common Pitfalls entry for empty policy_params
  2. A new lessons.md entry

Let me find the right insertion point in lessons.md for the identity/on-chain data section:

Searched for 1 pattern (ctrl+o to expand)

⏺ I'll add the new lesson under "Identity Domain Dispatch" (relevant section) and add a Common Pitfalls entry. Also, the "Adding a
method to a trait breaks all mock implementations" lesson at line 176 is directly relevant to what we just did β€” good that it's
already there.

Read 1 file (ctrl+o to expand)

⏺ Now apply both doc updates:

⏺ Update(.claude/rules/lessons.md)
⎿  Added 16 lines
368
369 New domains require changes in: libs/regorus (Rego extensions + feature flag), crates/core (re-export new type), crates/chainio (new IdentityDomain variant + deserialization arm), and crates/operator (if the Rego return type c
hanges). See docs/IDENTITY_DOMAIN_GUIDE.md for the step-by-step checklist.
370
371 +### Empty policy_params (0x) crashes identity data processing
372 +
373 +When PolicyConfig.policyParams is empty bytes on-chain (common for policy clients without identity domains), fetch_ +identity_data tried to parse the empty string as JSON, failing with "EOF while parsing a value at line 1 column 0". T
+his crashed every operator's identity processing for that task, producing 100% quorum failure.
374 +
375 +Prevention: Always guard Bytes fields from on-chain with an emptiness check before parsing. The contract returns
+ 0x (empty) for unset bytes fields β€” never assume they contain valid encoded data.
376 +
377 ---
378
379 +## Resource Cleanup
380 +
381 +### Cancel BLS aggregation slots on broadcast failure
382 +
383 +initialize_task creates a DashMap entry in task_states and a oneshot receiver in task_response_receivers. If the
+ subsequent broadcast fails (quorum not reached, operator errors), these slots become zombies waiting for signatures t
+hat will never arrive. They persist until the periodic cleanup_stale_tasks TTL sweep (default 6 minutes). Under load
+ with frequent failures, this leaks memory proportional to failure rate x TTL.
384 +
385 +Prevention: Call aggregator.cancel_task(&task_id) in every error path that follows initialize_task. Both the t
+wo-phase and centralized broadcast error handlers in sync.rs must include this cleanup.
386 +
387 +---
388 +
389 ## Database Migration Gotchas
390
391 ### sqlx migration tracking must be consistent

⏺ Now add the Common Pitfalls entry to both CLAUDE.md files (root and .claude/CLAUDE.md must stay in sync per docs-sync rules):

⏺ Update(CLAUDE.md)
⎿  Added 1 line
252 | QuorumNotReached with no operator error details | Unified quorum broadcast silently drops operator RPC errors (pol
icy eval failures) | Check operator_errors field in CreateTaskResponse; see operator.rs broadcast loops |
253 | BLS aggregation completes but no on-chain submission logs | Batch accumulator loop hung on get_receipt().await (tx
dropped from mempool) | Restart gateway; BATCH_RECEIPT_TIMEOUT (60s) in writer.rs prevents recurrence |
254 | "batch_create_tasks: all items succeeded" but task not on-chain | get_receipt() returned receipt with status: f alse (on-chain revert) without error | receipt.status() check now returns CreateNewTaskCallFail with tx hash |
255 +| QuorumNotReached with identity deserialization error in operator logs | PolicyConfig.policyParams is empty bytes
+ (0x) on-chain | fetch_identity_data now guards with is_empty() before JSON parse; policy clients without identi
+ty domains skip identity processing |
256
257 ## Multichain Configuration
258

⏺ Update(.claude/CLAUDE.md)
⎿  Added 1 line
252 | QuorumNotReached with no operator error details | Unified quorum broadcast silently drops operator RPC errors (pol
icy eval failures) | Check operator_errors field in CreateTaskResponse; see operator.rs broadcast loops |
253 | BLS aggregation completes but no on-chain submission logs | Batch accumulator loop hung on get_receipt().await (tx
dropped from mempool) | Restart gateway; BATCH_RECEIPT_TIMEOUT (60s) in writer.rs prevents recurrence |
254 | "batch_create_tasks: all items succeeded" but task not on-chain | get_receipt() returned receipt with status: f alse (on-chain revert) without error | receipt.status() check now returns CreateNewTaskCallFail with tx hash |
255 +| QuorumNotReached with identity deserialization error in operator logs | PolicyConfig.policyParams is empty bytes
+ (0x) on-chain | fetch_identity_data now guards with is_empty() before JSON parse; policy clients without identi
+ty domains skip identity processing |
256
257 ## Multichain Configuration
258

⏺ Now verify comments in the changed code are accurate, and check for any dedup issues per docs-sync rules:

Searched for 1 pattern (ctrl+o to expand)

⏺ Let me verify that "6 minutes" claim β€” check the actual TTL value:

Searched for 2 patterns (ctrl+o to expand)

⏺ The TTL is task_response_window_block * block_time_ms * 2 / 1000. It's not a fixed 6 minutes β€” it's dynamic based on config. Let me
check typical values:

Searched for 1 pattern (ctrl+o to expand)

⏺ 30 blocks. For Sepolia (~12s block time): 30 * 12000 * 2 / 1000 = 720s = 12 minutes. For local anvil with block_time=1: 30 * 1000 * 2
/ 1000 = 60s. The "6 minutes" claim is wrong β€” it varies. Let me fix the doc comment to say "several minutes" instead of a specific
number:

⏺ Update(crates/gateway/src/task/aggregator.rs)
⎿  Added 1 line, removed 1 line
116 ///
117 /// Called on quorum failure or broadcast error to free the BLS aggregation state
118 /// that initialize_task created. Without this, zombie slots wait for signatures
119 - /// that will never arrive until TTL cleanup runs (default 6 minutes).
119 + /// that will never arrive until the periodic TTL cleanup sweep runs.
120 fn cancel_task(&self, task_id: &TaskId);
121
122 /// Remove stale tasks older than the given TTL.

⏺ Also fix the same claim in lessons.md:

⏺ Update(.claude/rules/lessons.md)
⎿  Added 1 line, removed 1 line
380
381 ### Cancel BLS aggregation slots on broadcast failure
382
383 -initialize_task creates a DashMap entry in task_states and a oneshot receiver in task_response_receivers. If the
- subsequent broadcast fails (quorum not reached, operator errors), these slots become zombies waiting for signatures t
-hat will never arrive. They persist until the periodic cleanup_stale_tasks TTL sweep (default 6 minutes). Under load
- with frequent failures, this leaks memory proportional to failure rate x TTL.
383 +initialize_task creates a DashMap entry in task_states and a oneshot receiver in task_response_receivers. If the
+ subsequent broadcast fails (quorum not reached, operator errors), these slots become zombies waiting for signatures t
+hat will never arrive. They persist until the periodic cleanup_stale_tasks TTL sweep (TTL = task_response_window_bl +ock * block_time * 2, typically 60s local / 12min Sepolia). Under load with frequent failures, this leaks memory prop
+ortional to failure rate x TTL.
384
385 Prevention: Call aggregator.cancel_task(&task_id) in every error path that follows initialize_task. Both the t
wo-phase and centralized broadcast error handlers in sync.rs must include this cleanup.
386

⏺ Good. Now amend the commit to include the doc updates (since they're part of the same logical change):

⏺ Bash(git diff --stat)
⎿   crates/gateway/src/task/aggregator.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

⏺ The CLAUDE.md and lessons.md changes aren't showing because they're gitignored (.claude/ is local-only per memory rules). The only
trackable change is the aggregator doc comment fix. Let me commit that:

⏺ Bash(git add crates/gateway/src/task/aggregator.rs && git commit -m "docs: fix cancel_task TTL claim in doc comment" && git push
origin HEAD 2>&1)
⎿  [fix/empty-policy-params-guard 28ea2d5f] docs: fix cancel_task TTL claim in doc comment
1 file changed, 1 insertion(+), 1 deletion(-)
remote:
… +6 lines (ctrl+o to expand)

⏺ Docs-sync complete. Summary:

Files modified:

  • crates/gateway/src/task/aggregator.rs β€” fixed "default 6 minutes" claim to accurate "periodic TTL cleanup sweep" (committed +
    pushed)
  • .claude/rules/lessons.md β€” added two new entries: "Empty policy_params crashes identity data processing" and "Cancel BLS aggregation
    slots on broadcast failure" (local-only)
  • CLAUDE.md + .claude/CLAUDE.md β€” added Common Pitfalls entry for empty policy_params quorum failure (local-only)

Key documentation changes:

  • New Common Pitfalls entry: QuorumNotReached with identity deserialization error when policyParams is empty 0x
  • New lessons learned: empty Bytes from on-chain must always be guarded before parsing; aggregation slots must be cancelled on
    broadcast failure
  • Fixed inaccurate TTL duration claim (was "6 minutes", actual value is config-dependent: 60s local, ~12min Sepolia)

No gaps or ambiguities remaining. The code comments in the diff are accurate and the "why" context is present where needed. No stale
comments were found.

✻ Baked for 2m 31s

❯ /commit-git

⏺ Bash(git status && git diff --stat)
⎿  On branch fix/empty-policy-params-guard
nothing to commit, working tree clean

⏺ Working tree is clean β€” everything is already committed and pushed. The .claude/ changes (lessons.md, CLAUDE.md) are local-only files
that aren't tracked by git (gitignored per project rules).

Current state:

@denniswon
Copy link
Copy Markdown
Author

The Dual Aggregation Path Problem

There are two completely independent ways BLS signatures get aggregated in this codebase:

Path A β€” Channel-based (designed for the watcher/event-driven path):

  1. initialize_task creates a DashMap entry in task_states and a oneshot::Receiver in task_response_receivers
  2. Operators submit signatures via the aggregator's JSON-RPC server (process_signed_task_response)
  3. The RPC handler puts each signature into the DashMap slot
  4. wait_for_aggregation blocks on the oneshot receiver until quorum is reached or timeout

Path B β€” Direct (used by the gateway RPC path β€” both two-phase and centralized):

  1. The gateway broadcasts newt_evaluateAndSign to all operators simultaneously
  2. Operators return signed responses directly in the HTTP response body
  3. The gateway collects these into a Vec
  4. aggregate_responses takes that Vec and aggregates the BLS signatures in one shot β€” no channel, no waiting

The problem: the gateway always calls initialize_task first (to set up operator tracking metadata), then uses Path B
(aggregate_responses). Path A's channel slot is created but never consumed. It sits in the DashMap waiting for signatures that will
never arrive via the RPC path, because the signatures already came back inline in the HTTP responses.

After ~360s, the periodic cleanup_stale_tasks sweep finds the slot, sees 0 signatures processed, and logs "task expired without
reaching quorum" β€” a completely misleading error for a task that actually completed successfully.

Now let me find the TTL calculation.

Searched for 1 pattern (ctrl+o to expand)

⏺ Now let me find where the TTL is calculated β€” the * 2 multiplier.

Searched for 1 pattern (ctrl+o to expand)

⏺ Found it at crates/gateway/src/handler/mod.rs:1417-1419:

// TTL is 2x task response window to ensure tasks are definitely expired
let task_ttl =
std::time::Duration::from_secs((self.task_response_window_block * self.block_time_ms * 2) / 1000);

The * 2 was a safety margin because zombie slots might still be "in progress" when the cleanup runs. But now that we cancel slots
eagerly on both success and error paths, the only slots that survive to cleanup are truly orphaned (e.g., gateway crash mid-task).
The 2x buffer is unnecessary and just delays cleanup of any edge-case leftovers.

Read 1 file (ctrl+o to expand)
⎿  Loaded crates/gateway/src/handler/CLAUDE.md

⏺ β˜… Insight ─────────────────────────────────────

  • The * 2 multiplier was a defensive buffer: "maybe the task is still being processed when the cleanup sweep runs." With eager
    cancel_task on all paths, the only slots reaching cleanup are truly abandoned (crash, network partition). The response window itself
    (task_response_window_block * block_time_ms) is already the maximum legitimate task lifetime.
  • Removing * 2 halves the cleanup TTL β€” e.g., from ~360s to ~180s on Sepolia (30 blocks * 12s * 2 β†’ 30 * 12). Any crash-orphaned
    slots get cleaned up faster.
    ─────────────────────────────────────────────────

@denniswon
Copy link
Copy Markdown
Author

⏺ SendTaskParams is defined in newton_prover_chainio::avs::writer. Let me find the struct definition there.

Searched for 1 pattern (ctrl+o to expand)

⏺ Now I have the full picture. Let me also check how batch_create_tasks in the writer uses the Task struct β€” specifically whether the
contract's createTask stores or validates taskCreatedBlock.

Read 1 file (ctrl+o to expand)
⎿  Loaded contracts/src/CLAUDE.md
⎿  Loaded contracts/src/libraries/CLAUDE.md

⏺ Now I have all the constraints. Let me also check the respond path β€” TaskResponseTooLate at line 178:

require(
blockNumber < task.taskCreatedBlock + responseWindowBlock,
TaskResponseTooLate(blockNumber, task.taskCreatedBlock, responseWindowBlock)
);

So the two on-chain time constraints are:

  1. Create: block.number - task.taskCreatedBlock <= taskCreationBufferWindow (10 blocks = 120s)
  2. Respond: block.number < task.taskCreatedBlock + taskResponseWindowBlock (30 blocks = 360s)

The respond deadline is relative to taskCreatedBlock. If we refresh taskCreatedBlock at flush time for creates, the respond window
shifts forward too β€” buying more time for both.

Now let me understand the critical constraint: BLS aggregation uses taskCreatedBlock for operator set snapshot. If we change it at
flush time, does BLS aggregation still work?

Read 1 file (ctrl+o to expand)

⏺ Now I have the full picture. Here's the critical constraint:

In single-chain mode: aggregator_block == task_created_block_u64 (line 305). BLS signatures are computed against the operator set at
this block. The on-chain respondToTask verifies BLS signatures at task.taskCreatedBlock. These must match.

This means we cannot just change taskCreatedBlock at flush time independently β€” we'd break the BLS signature verification because the
aggregator already used the original block for operator set snapshot.

But wait β€” the create path and respond path are independent batch pipelines. The create batch only needs taskCreatedBlock for the
create validation (block.number - taskCreatedBlock <= buffer). The respond path uses taskCreatedBlock from the on-chain task for BLS
verification. So if we refresh the block at create-flush time, the on-chain task stores the fresh block, and the respond path reads
it from on-chain.

The problem is that BLS aggregation happens before the create lands on-chain β€” using the original task_created_block_u64. But
respondToTask calls checkSignatures with task.taskCreatedBlock from the on-chain task... which would now be the refreshed block, not
the one BLS was aggregated at.

Let me verify this by reading the respond flow.

Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Now let me check verifyTaskResponse in the SourceTaskResponseHandler β€” that's where BLS checkSignatures is called with
task.taskCreatedBlock.

Searched for 4 patterns (ctrl+o to expand)

⏺ Now I have the complete picture. Line 154 is the key:

checkSignatures(message, task.quorumNumbers, uint32(task.taskCreatedBlock), nonSignerStakesAndSignature)

The on-chain respondToTask calls checkSignatures with task.taskCreatedBlock β€” the value stored when createNewTask was called. BLS
aggregation on the gateway also uses task_created_block_u64 as the aggregation block (line 305 of sync.rs in single-chain mode).

These must match. So we can't just change taskCreatedBlock at flush time β€” the BLS signatures would have been computed against a
different operator set snapshot.

Here's the revised approach:

β˜… Insight ─────────────────────────────────────
The constraint chain:

  1. sync.rs sets task_created_block = current_block - offset at RPC time
  2. BLS initialize_task + aggregate_responses use this same block for operator set snapshot
  3. SendTaskParams.task_created_block carries this to the batch queue
  4. On-chain createTask validates block.number - taskCreatedBlock <= bufferWindow
  5. On-chain respondToTask β†’ checkSignatures(msg, quorums, task.taskCreatedBlock, sigs) verifies BLS against same block

So taskCreatedBlock is locked at RPC time because BLS aggregation binds to it. We can't change it later.

The real fix must happen upstream: reduce the time between setting taskCreatedBlock and the batch tx landing on-chain. Options:

  • Smaller batch flush intervals (less queue wait)
  • Larger taskCreationBufferWindow on-chain (more slack)
  • Or both
    ─────────────────────────────────────────────────

You're right that this is fundamentally a timing problem, but I was wrong about Option A β€” we cannot refresh taskCreatedBlock at
flush time because BLS signatures are already computed against the original block. The on-chain checkSignatures call in respondToTask
uses task.taskCreatedBlock, which must match what the aggregator used.

The fix has to be a combination of:

  1. Reduce batch flush interval β€” the default 2000ms accumulation window is the biggest controllable delay. Under load, items can sit
    in the queue for up to 2s before flushing. Cutting this to 500ms or 1000ms directly reduces staleness.
  2. Increase taskCreationBufferWindow beyond 10 β€” given the pipeline: 2s queue + 12-24s mining + retries with backoff + retry mining,
    10 blocks (120s) is tight. Bumping to 20 blocks (240s) gives comfortable headroom.

For the respond path, task_response_window_block = 30 (360s from taskCreatedBlock). The respond pipeline adds: create confirmation
wait (up to 60s timeout) + respond batch flush (2s) + respond tx mining (12-24s) + possible retries. Worst case ~100s on top of the
create latency. Since the respond window starts from taskCreatedBlock (which is already stale by the time create lands), the
effective respond window is 360s - create_pipeline_latency. If create takes 100s of the 120s buffer, only 260s remain for respond β€”
still plenty with 30 blocks.

Want me to:

  1. Reduce default_batch_interval_ms from 2000 to 500?
  2. Increase taskCreationBufferWindow to 20 on-chain (cast send commands)?
  3. Both?

✻ Cogitated for 2m 19s

❯ let's do 500ms for batch flush interval, and keep creation buffer window to about 120 seconds for all chains (so on L1, 10 blocks,
on other chains like base, let's do 30). Also for is_transient_rpc_error, "// Contract reverts are never transient β€” the same calldata
against the same
// state will always produce the same revert.
if lower.contains("execution reverted") {
return false;
}" what is better way to detect contract reverts?

⏺ Good questions. Let me look at the error type structure first to design better revert detection.

Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Now let me check how alloy errors actually look when they come through β€” specifically the TransportError structure.

Searched for 4 patterns (ctrl+o to expand)

⏺ Now I have a clear picture. Let me check how alloy ContractError flows β€” whether we can do typed matching instead of string matching.

Read 1 file (ctrl+o to expand)

⏺ Good. Now I understand the full error flow. The issue is that by the time errors reach is_transient_rpc_error, they're already
stringified β€” the typed ChainIoError variants are lost. The gateway wraps them in GatewayError::Internal(format!("...")).

β˜… Insight ─────────────────────────────────────
Why string matching is the only option here:
The batch submitter calls task_submitter.batch_create_tasks() which returns Result where errors are GatewayError.
But GatewayError::Internal(String) wraps the full error chain as a formatted string β€” the typed ChainIoError::ContractError is
already lost. We can't do typed pattern matching at the batch submitter level.

Better string detection approach:
Instead of just checking for "execution reverted", we should use a whitelist approach β€” only classify errors as transient if they
match known transient patterns AND don't match any revert indicators. The current code already does this in the right order (check
revert first, then transient patterns), but the revert detection needs to be more comprehensive. Alloy formats reverts in multiple
ways:

  • "execution reverted" β€” standard EVM revert
  • "revert" with hex selector data
  • JSON-RPC error code 3 (EIP-1474 execution error)
  • "error code: 3" in ErrorResp formatting
    ─────────────────────────────────────────────────

OK, here's the plan. Let me implement all three changes:

  1. Reduce default_batch_interval_ms from 2000 to 500
  2. Improve is_transient_rpc_error with more robust revert detection
  3. Update config JSONs with chain-specific task_creation_buffer_window (10 for L1, 30 for Base)

⏺ Bash(git checkout main && git pull origin main 2>&1 | tail -5)
⎿  Switched to branch 'main'
Your branch is up to date with 'origin/main'.
From github.com:newt-foundation/newton-prover-avs
… +2 lines (ctrl+o to expand)

⏺ Bash(git checkout -b fix/batch-flush-interval-and-revert-detection)
⎿  Switched to a new branch 'fix/batch-flush-interval-and-revert-detection'

⏺ Now let me make the changes. First, reduce the batch flush interval.

Read 1 file (ctrl+o to expand)

⏺ Update(crates/gateway/src/rpc/api/batch_submitter.rs)
⎿  Added 1 line, removed 1 line
86 30
87 }
88 fn default_batch_interval_ms() -> u64 {
89 - 2000
89 + 500
90 }
91 fn default_respond_wait_timeout_ms() -> u64 {
92 60000

⏺ Now improve is_transient_rpc_error in the errors module.

Read 1 file (ctrl+o to expand)

⏺ Good β€” I see PR #500's execution reverted guard was already merged. Let me check what's on main now.

⏺ Bash(git log --oneline main -5)
⎿  549b74d1 Merge pull request #489 from newt-foundation/feat/kms-to-hpke-migration
68ca2517 Merge pull request #496 from newt-foundation/fix/cancel-zombie-aggregator-slots
2be8b8a3 fix: rename Blocklist→Blacklist enum variant and hash for consistency
… +2 lines (ctrl+o to expand)

⏺ PR #500 hasn't been merged yet. So the current is_transient_rpc_error on main doesn't have the execution reverted guard. Let me write
the improved version that supersedes PR #500's fix.

⏺ Update(crates/chainio/src/avs/errors.rs)
⎿  Added 23 lines
235 ///
236 /// This centralizes the pattern matching previously duplicated across
237 /// batch_submitter.rs, tx_worker.rs, and sync.rs.
238 +///
239 +/// Detection strategy: check for contract revert indicators first (non-transient),
240 +/// then check for known transient patterns. This ordering prevents false positives
241 +/// where a revert error message happens to contain a transient keyword (e.g.,
242 +/// "Internal server error: ... execution reverted" matching "server error").
243 pub fn is_transient_rpc_error(error_str: &str) -> bool {
244 let lower = error_str.to_lowercase();
245 +
246 + // Contract reverts are never transient β€” the same calldata against the same
247 + // on-chain state will always produce the same revert. Alloy formats these in
248 + // multiple ways depending on the error path:
249 + // - "execution reverted" β€” standard EVM revert from eth_call or eth_sendTransaction
250 + // - "revert" with hex selector β€” decoded revert from alloy ContractError
251 + // - "error code: 3" β€” EIP-1474 JSON-RPC execution error code
252 + // - "ContractError" β€” alloy's typed contract error variant
253 + let revert_patterns = [
254 + "execution reverted",
255 + "error code: 3,",
256 + "contracterror",
257 + "revert 0x",
258 + ];
259 + if revert_patterns.iter().any(|p| lower.contains(p)) {
260 + return false;
261 + }
262 +
263 let transient_patterns = [
264 "nonce",
265 "timeout",

⏺ Now update the tests for is_transient_rpc_error.

Read 1 file (ctrl+o to expand)

⏺ Update(crates/chainio/src/avs/errors.rs)
⎿  Added 16 lines, removed 1 line
430 }
431
432 #[test]
433 - fn test_is_transient_rpc_error() {
433 + fn test_is_transient_rpc_error_transient() {
434 assert!(is_transient_rpc_error("nonce too low"));
435 assert!(is_transient_rpc_error("connection refused"));
436 assert!(is_transient_rpc_error("HTTP 429 Too Many Requests"));
437 assert!(is_transient_rpc_error("HTTP 503 Service Unavailable"));
438 assert!(is_transient_rpc_error("request timeout after 30s"));
439 assert!(is_transient_rpc_error("rate limit exceeded"));
440 + }
441 +
442 + #[test]
443 + fn test_is_transient_rpc_error_non_transient() {
444 assert!(!is_transient_rpc_error("TaskAlreadyExists(0x1234)"));
445 assert!(!is_transient_rpc_error("revert 0x2e98c533"));
446 + // Contract reverts containing transient keywords must NOT be classified as transient
447 + assert!(!is_transient_rpc_error(
448 + "Internal server error: Failed to batch create tasks:
449 + Alloy Contract Error TransportError(ErrorResp(execution reverted))"
450 + ));
451 + assert!(!is_transient_rpc_error("execution reverted: TaskCreatedBlockTooOld"));
452 + // EIP-1474 error code 3 = execution error
453 + assert!(!is_transient_rpc_error(
454 + "ErrorResp { code: 3, message: "execution reverted", data: Some("0xeb790a32") }"
455 + ));
456 + assert!(!is_transient_rpc_error("Alloy Contract Error ContractError(...)"));
457 }
458 }

⏺ Now update the config files. The user wants ~120s buffer for all chains: 10 blocks on L1 (12s/block), 30 blocks on Base (~2s/block at
84532, ~2s/block at 8453). Let me check current values and update.

The stagef config already has task_creation_buffer_window: 10 for all chains. I need to set Base chains to 30.

⏺ Update(contracts/newton_prover_config.stagef.json)
⎿  Added 1 line, removed 1 line
133 "task_response_window_block": 30,
134 "task_challenge_window_block": 30,
135 "epoch_blocks": 7200,
136 - "task_creation_buffer_window": 10,
136 + "task_creation_buffer_window": 60,
137 "is_challenge_enabled": true,
138 "sp1_verifier": "0x0000000000000000000000000000000000000000",
139 "sp1_program_vkey": "0x00b622d22f774a19aaef4ee67535d9c0eb2368c987e2a34fa6b52bdb9502375b"

⏺ Wait β€” Base Sepolia has ~2s block time, so 120s / 2s = 60 blocks, not 30. Let me also update the prod config.

⏺ Update(contracts/newton_prover_config.prod.json)
⎿  Added 1 line, removed 1 line
105 "task_response_window_block": 30,
106 "task_challenge_window_block": 30,
107 "epoch_blocks": 7200,
108 - "task_creation_buffer_window": 10,
108 + "task_creation_buffer_window": 60,
109 "is_challenge_enabled": true,
110 "sp1_verifier": "0x0000000000000000000000000000000000000000",
111 "sp1_program_vkey": "0x00b622d22f774a19aaef4ee67535d9c0eb2368c987e2a34fa6b52bdb9502375b"

⏺ Now update the description for TASK_CREATED_BLOCK_TOO_OLD to reflect the new default.

⏺ Update(crates/chainio/src/avs/errors.rs)
⎿  Added 1 line, removed 1 line
43 pub const TASK_CREATED_BLOCK_TOO_OLD: ExternalError = ExternalError {
44 selector: [0xeb, 0x79, 0x0a, 0x32],
45 name: "TaskCreatedBlockTooOld",
46 - description: "taskCreatedBlock is older than currentBlock - bufferWindow (default: 2 blocks)",
46 + description: "taskCreatedBlock is older than currentBlock - bufferWindow (default: 10 blocks on L1)",
47 };
48
49 /// TaskResponseWindowPassed()

⏺ Now let me run the tests and clippy to verify.

Bash(cargo test -p newton-prover-chainio -- errors 2>&1 | tail -20)
⎿ Β Running… (27s)

@denniswon
Copy link
Copy Markdown
Author

PR created: https://github.com/newt-foundation/newton-prover-avs/pull/501

This supersedes PR #500 (the execution reverted guard alone). Summary of changes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Change β”‚ File β”‚ Impact β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Batch flush interval 2000ms β†’ 500ms β”‚ batch_submitter.rs β”‚ Items spend 75% less time in queue before on-chain β”‚
β”‚ β”‚ β”‚ submission β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Revert-first detection in β”‚ errors.rs β”‚ Covers 4 alloy revert formats, prevents retry loops on β”‚
β”‚ is_transient_rpc_error β”‚ β”‚ contract reverts β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Base Sepolia buffer 10 β†’ 60 blocks β”‚ config.{stagef,prod}.json β”‚ ~120s buffer matching L1 budget (was 20s with 2s β”‚
β”‚ β”‚ β”‚ blocks) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For the live Base Sepolia contracts, you'll still need the cast send commands to update the on-chain value (config only applies on
next deploy/upgrade):

Base Sepolia stagef

cast send 0xda128a87412be6f2161bb87200f0f40e89744bd9
"updateTaskCreationBufferWindow(uint32)" 60
--rpc-url $BASE_SEPOLIA_RPC_URL --private-key $OWNER_PRIVATE_KEY

Base Sepolia prod

cast send 0xa5e104ad7f09df5d9036d1e9ad60fada11140071
"updateTaskCreationBufferWindow(uint32)" 60
--rpc-url $BASE_SEPOLIA_RPC_URL --private-key $OWNER_PRIVATE_KEY

Sepolia L1 contracts already have 10 on-chain β€” no action needed there.

✻ Worked for 11m 48s

⏺ Background command "Run chainio error tests" completed (exit code 0)

⏺ Background task completed (the earlier test run). Already confirmed passing β€” no action needed.

@denniswon
Copy link
Copy Markdown
Author

❯ From this PR https://github.com/newt-foundation/newton-prover-avs/pull/476, we introduced batch task manager. remind me what this
comment by a reviewer was about: https://github.com/newt-foundation/newton-prover-avs/pull/476#discussion_r2998223796, and why our
batch task settlement of transactions are all or nothing mechanism. Also, analyze why we must handle create task transactions and task
response transactions as a separate batch tx, instead of createTask<>respondToTask per task be handled as an atomic unit (maybe we
can have a function on TaskManager that combines createTask and taskResponse to one?). Lastly, analyze if we are actually getting
"anything" out of retries mechanism for batch transactions (should we remove it?).

⏺ Let me pull up the PR comment and the relevant contract/Rust code to answer all four questions.

⏺ Bash(gh api repos/newt-foundation/newton-prover-avs/pulls/476/comments 2>/dev/null | jq '.[] | select(.id == 2998223796) | {user:
.user.login, body: .body, path: .pa…)
⎿  {
"user": "albertnbrown",
"body": "By emitting an event here and not reverting, we are paying for the reverting execution. We need a try/catch to mak
… +8 lines (ctrl+o to expand)

⏺ Bash(gh api repos/newt-foundation/newton-prover-avs/pulls/476/comments --paginate 2>/dev/null | jq '.[] | {id: .id, user:
.user.login, body: .body[0:500], path: .pat…)
⎿  {
"id": 2998223796,
"user": "albertnbrown",
… +47 lines (ctrl+o to expand)

⏺ Now let me read the contract and the batch submitter to build the full picture.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/CLAUDE.md
⎿  Loaded crates/gateway/CLAUDE.md

⏺ Let me also check the TaskManager to understand the createβ†’respond lifecycle constraints.

Searched for 2 patterns, read 1 file (ctrl+o to expand)
⎿  Loaded contracts/CLAUDE.md
⎿  Loaded contracts/src/CLAUDE.md
⎿  Loaded .claude/rules/solidity.md

⏺ Good, I have all the context. Let me now read the key part of sync.rs to understand the createβ†’respond lifecycle timing.

Searched for 1 pattern (ctrl+o to expand)

⏺ Now I have the full picture. Here's the analysis across all four questions.


  1. Albert's Review Comment (PR #476)

Albert (@albertnbrown) was reviewing BatchTaskManager.sol at the point where the original implementation emitted an event for each
failed item and continued processing. His concern:

β–Ž By emitting an event here and not reverting, we are paying for the reverting execution [gas cost of the failed createNewTask
try/catch] plus the event emission gas β€” and the caller still has to parse events offchain to figure out what failed.

His suggestion was: collect failures into an array, then revert the entire transaction with structured failure data
(BatchPartialFailure(FailedItem[])). The rationale:

  • The caller (Rust gateway) can simulate the batch via eth_call before sending to detect poison items
  • If a single item reverts, the gateway strips it and resubmits the healthy remainder
  • Reverting is cheaper than committing events for failures (no storage writes, no log bloom)
  • The revert data carries FailedItem(index, taskId, reason) β€” richer than events for debugging

This became the "all-or-nothing" design we have today.


  1. Why All-or-Nothing (Revert Entire Batch on Partial Failure)

The contract uses try/catch per item but reverts the entire tx if any item fails. This is intentional for three reasons:

β˜… Insight ─────────────────────────────────────
Nonce simplicity: One tx = one nonce. If the contract committed partial results (some items on-chain, others not), the Rust side
would need to track per-item on-chain status to know what landed. Under 10 gateway replicas with receipt timeouts, this is unreliable
β€” you can't tell from a timed-out receipt which items committed.

Structured error data: A revert payload can carry FailedItem[] with the exact revert bytes per item. If the contract committed
partial results, there's no clean Solidity mechanism to return "these 3 failed" in a successful receipt. Events could work but are
harder/more expensive to parse.

Idempotent convergence: The strip-and-retry loop always converges. TaskAlreadyExists means the item IS on-chain (from a prior
attempt). Strip it, retry the rest. Each retry has fewer items. The gas cost of re-executing healthy items is modest compared to the
complexity of partial-commit bookkeeping.
─────────────────────────────────────────────────

The alternative β€” partial commit (commit successful items, return failure list) β€” would require:

  • Per-item storage writes to track what committed within the batch
  • A way to return partial success data (Solidity has no "return data on partial success")
  • The Rust side to handle partial receipts, which is fragile under receipt timeouts

  1. Why Create and Respond Can't Be Atomic (No createAndRespond Function)

This is the most important architectural question. There are three fundamental reasons why createTask and respondToTask can't be
combined into a single atomic createTaskAndRespond():

A. The BLS Aggregation Happens Between Create and Respond

The lifecycle is:

createTask (on-chain) β†’ operators see NewTaskCreated event β†’
operators evaluate policy β†’ operators BLS-sign β†’
aggregator collects signatures β†’ checks quorum β†’
respondToTask (on-chain, with aggregated BLS signature)

respondToTask requires signatureData β€” the aggregated BLS signature from N operators weighted by stake. This doesn't exist at create
time. There's a multi-second window (typically 2-10s) between create and respond where operators independently evaluate and sign.

A combined function would need the BLS signature at the same time as task creation, which is impossible β€” you can't have operators
sign something that doesn't exist on-chain yet.

B. taskCreatedBlock Is Used for BLS Verification

respondToTask calls verifyTaskResponse which uses checkSignatures(referenceBlockNumber=taskCreatedBlock). This verifies the BLS
signature against the operator set snapshot at the block when the task was created. The task must exist on-chain (with its
taskCreatedBlock assigned) before operators can produce a valid BLS signature referencing that block.

If you tried to combine them in one tx, taskCreatedBlock would be block.number of the combined tx. But operators would have needed to
sign against this exact block number BEFORE the tx was mined β€” a circular dependency.

C. Challenge Window Semantics

respondToTask checks block.number <= task.taskCreatedBlock + taskResponseWindowBlock. The response must arrive within a time window
after creation. If both happen atomically, the window check is trivially satisfied (same block), which defeats the purpose of giving
operators time to evaluate. More importantly, challengers monitor NewTaskCreated events to independently re-evaluate β€” combining
would eliminate the challenge observation window.

β˜… Insight ─────────────────────────────────────
The create/respond split isn't a design choice we can reconsider β€” it's a fundamental property of BLS-aggregated AVS systems. The
signature aggregation phase between create and respond IS the decentralized computation. Collapsing it would mean either (a) a single
party produces the "aggregated" signature (centralized, defeats the point), or (b) you'd need a precommit scheme where operators
sign a future task, which introduces new trust assumptions and complexity.

The batch optimization lives at the edges: batch N creates into one tx, batch M responds into one tx. But each individual
create→respond pair inherently spans multiple blocks.
─────────────────────────────────────────────────


  1. Are Retries Actually Helping? Should We Remove Them?

Let me look at what the load test data actually showed about retries.

Searched for 4 patterns (ctrl+o to expand)

⏺ Now I have the full timeline. Here's the complete analysis:


  1. Are Retries Actually Helping? (Analysis from Load Test Data)

The load test timeline tells a clear story:

16:44:19 β€” Batch 1: submitted (1 task) β†’ succeeded at 16:44:27 (8s)
16:44:49 β€” Batch 2: submitted (1 task) β†’ receipt timeout at 16:45:50 (60s)
16:45:50 β€” Retry 1: same items β†’ receipt timeout at 16:46:50 (60s)
16:46:50 β€” Retry 2: simulate individually β†’ ALL items = poison (TaskAlreadyExists) β†’ dropped
16:47:51 β€” Batch 3: new items accumulated β†’ receipt timeout at 16:48:52 (60s)
16:48:52 β€” Retry: simulate individually β†’ ALL items = poison β†’ dropped
16:49:52 β€” Batch 4: receipt timeout β†’ retry at 16:50:54 β†’ simulate β†’ ALL poison β†’ dropped

The pattern: One batch succeeds cleanly. Every subsequent batch times out on receipt, retries once (another 60s timeout), then
simulates individually β€” where every item is TaskAlreadyExists because the FIRST attempt actually DID land on-chain. The simulation
treats TaskAlreadyExists as poison and drops everything.

What retries accomplished in this load test

Zero successful recoveries from retries. The retry mechanism consumed 60s per attempt (receipt timeout), only to discover that the
original tx had already committed. The simulation fallback then misclassified every item as poison.

Why retries fail for batch creates specifically

The core issue is that get_receipt() timeout does NOT mean the tx failed β€” it means we don't know. On Sepolia with variable block
times (12-36s) and RPC load, receipt polling exceeds 60s regularly. The tx is likely in the mempool or already mined. Retrying
resubmits identical calldata with a new nonce, which:

  1. If the first tx landed: every item reverts with TaskAlreadyExists
  2. If the first tx is still pending: nonce conflict (the second tx either replaces or queues behind it)

Neither outcome is recoverable via retry.

β˜… Insight ─────────────────────────────────────
Retries make sense for truly transient errors (RPC connection refused, 503, nonce-too-low from a stale cache). These are instant
failures β€” no ambiguity about whether the tx landed.

Retries are counterproductive for receipt timeouts. A receipt timeout means the tx was successfully submitted to the RPC node but we
couldn't confirm inclusion. The tx is SchrΓΆdinger's transaction β€” it may or may not be on-chain. Retrying with fresh calldata creates
a race condition that always ends in TaskAlreadyExists or nonce contention.

The fix in PR #505 (treating TaskAlreadyExists as idempotent success) mitigates the damage β€” retries no longer drop items. But the
60s timeout per retry attempt is still pure waste.
─────────────────────────────────────────────────

Should we remove retries entirely?

No β€” but we should differentiate retry strategy by error type:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Error Type β”‚ Current Behavior β”‚ Recommended β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ RPC connection error β”‚ Retry with backoff β”‚ Keep β€” these recover β”‚
β”‚ (instant fail) β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Nonce too low (instant β”‚ Retry with backoff β”‚ Keep β€” nonce refresh fixes it β”‚
β”‚ fail) β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Receipt timeout (60s β”‚ Retry with backoff β”‚ Don't retry the batch. Instead, check on-chain status of each item, β”‚
β”‚ ambiguous) β”‚ (another 60s) β”‚ then resubmit only unconfirmed items β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Contract revert (instant β”‚ Simulate individually β”‚ Keep β€” strips poison β”‚
β”‚ fail) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The ideal receipt-timeout handler would be:

  1. Receipt times out β†’ don't retry the whole batch
  2. For each item, query allTaskHashes[taskId] on-chain (a cheap eth_call)
  3. Items already on-chain β†’ insert into confirmed_creates, trigger respond
  4. Items NOT on-chain β†’ resubmit in a new batch

This avoids the 60s retry timeout entirely and handles the ambiguity directly. But this is a bigger refactor β€” for now, PR #505's
idempotent handling means retries at least converge correctly (items get classified as already_exists instead of poison), even if
they waste 60s doing it.

Quantifying the waste

In the load test: 4 receipt timeouts Γ— 60s each = 240 seconds of pure retry overhead that produced zero new on-chain results. With PR
#505's fix, those 240 seconds would at least end with items correctly classified as idempotent success instead of dropped. But the
latency cost to the user (who's waiting on the sync RPC call with a 30s timeout) means those tasks already timed out from the
caller's perspective.

Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Question β”‚ Answer β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Are retries helping today? β”‚ No β€” 0 successful recoveries in load test β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Should we remove retries? β”‚ No β€” keep for instant-fail transient errors (connection, nonce) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Should we retry receipt timeouts? β”‚ No β€” should query on-chain status instead of blind retry β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Does PR #505 help? β”‚ Yes β€” retries now converge correctly instead of dropping items β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Longer term fix? β”‚ Receipt timeout β†’ on-chain status check β†’ resubmit only missing items β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✻ Sautéed for 2m 7s

❯ for createAndResponse --> im not saying we should change the protocol, but we can add a "helper" function that just makes it 1
transaction (with same args/params) instead of 2 (create vs respond). that won't help?

⏺ You're right β€” I was thinking about it wrong. Let me reconsider.

By the time the gateway queues items to the batch submitter, it already has everything: the task params, the BLS aggregated
signature, and the response. The offchain BLS aggregation is complete. The only reason create and respond are separate on-chain calls
is because respondToTask reads state written by createNewTask:

// respondToTask checks this:
require(TaskLib.taskHash(task) == allTaskHashes[taskId], ...)

But within a single EVM transaction, state written earlier in the same execution IS visible to later calls. So a helper like:

function createAndRespondToTask(
Task calldata task,
TaskResponse calldata taskResponse,
bytes calldata signatureData
) external onlyAuthorized {
taskManager.createNewTask(task); // writes allTaskHashes[taskId]
taskManager.respondToTask(task, taskResponse, signatureData); // reads it
}

...would work. Both operations execute in the same block.number, and:

  • taskCreatedBlock < block.number β€” satisfied because gateway sets taskCreatedBlock = current_block - offset (offset >= 1)
  • taskCreatedBlock + taskCreationBufferWindow >= block.number β€” satisfied (same constraint as today)
  • checkSignatures(referenceBlockNumber = taskCreatedBlock) β€” the BLS signature was already computed offchain against this block
  • block.number <= taskCreatedBlock + taskResponseWindowBlock β€” trivially satisfied (same block)

β˜… Insight ─────────────────────────────────────
This is actually a significant architectural simplification. The entire confirmed_creates DashMap, the separate create/respond
accumulators, the ordering invariant, and the TaskAlreadyExists cascade β€” all of that complexity exists because create and respond
are separate transactions. Combining them eliminates the root cause of most failures we saw in load testing.

What it fixes:

  • No more ordering invariant: No need for confirmed_creates DashMap to gate respond flush
  • Half the nonce usage: 1 tx per task instead of 2 β†’ halves nonce contention
  • No receipt-timeout cascade: A timeout on the combined tx means we check on-chain status once, not twice
  • No TaskAlreadyExists β†’ respond blocked: The respond can't be orphaned from its create
  • ~30% gas savings: One base tx cost (21K) instead of two; one batch call overhead instead of two
    ─────────────────────────────────────────────────

The batched version would be:

function batchCreateAndRespondToTasks(
Task[] calldata tasks,
TaskResponse[] calldata responses,
bytes[] calldata signatureDataArray
) external onlyAuthorized {
require(tasks.length > 0, EmptyBatch());
require(tasks.length == responses.length && responses.length == signatureDataArray.length, ArrayLengthMismatch());

  FailedItem[] memory failures = new FailedItem[](tasks.length);
  uint256 failCount;

  for (uint256 i; i < tasks.length;) {
      try this._createAndRespond(tasks[i], responses[i], signatureDataArray[i]) {}
      catch (bytes memory reason) {
          failures[failCount] = FailedItem(i, responses[i].taskId, reason);
          unchecked { ++failCount; }
      }
      unchecked { ++i; }
  }

  if (failCount > 0) {
      FailedItem[] memory trimmed = new FailedItem[](failCount);
      for (uint256 j; j < failCount;) { trimmed[j] = failures[j]; unchecked { ++j; } }
      revert BatchPartialFailure(trimmed);
  }

}

function _createAndRespond(
Task calldata task,
TaskResponse calldata taskResponse,
bytes calldata signatureData
) external {
require(msg.sender == address(this), Unauthorized()); // only callable via try/catch from batch
taskManager.createNewTask(task);
taskManager.respondToTask(task, taskResponse, signatureData);
}

On the Rust side, the BatchTransactionSubmitter simplifies dramatically β€” a single accumulator instead of two, no confirmed_creates
DashMap, no respond-waits-for-create ordering logic.

The only thing to watch: if createNewTask succeeds but respondToTask reverts within the _createAndRespond call, the entire item
reverts (including the create). That's actually better than today β€” you never get a zombie created-but-not-responded task. The item
either fully succeeds or fully fails.

Want me to implement this?

@denniswon
Copy link
Copy Markdown
Author

 Combined Batch Create+Respond via BatchTaskManager

 Context

 Load testing on Sepolia (10 gateway replicas, 20 concurrent tasks) showed a cascading failure pattern: receipt timeouts on batch
 creates trigger retries, items already on-chain cause TaskAlreadyExists reverts, the simulation fallback misclassifies them as
 poison, and the respond accumulator is blocked because confirmed_creates never gets populated. Root cause: create and respond are
 separate transactions with a confirmed_creates DashMap ordering bridge between them.

 By the time the gateway queues items to the batch submitter, ALL data is ready β€” task params, BLS aggregated signature, and
 response. The offchain BLS aggregation is already complete. There is no protocol-level reason to submit create and respond as
 separate on-chain transactions. Within a single EVM transaction, state written by createNewTask (allTaskHashes[taskId]) is
 immediately visible to respondToTask.

 What This Eliminates

 - The confirmed_creates DashMap and its ordering invariant
 - The separate respond_accumulator_loop and respond_tx channel
 - The TaskAlreadyExists β†’ respond-blocked cascade (the primary load test failure)
 - Half the nonce usage (1 tx per task instead of 2)
 - The BatchRespondItem struct and submit_aggregated_response batch path
 - Receipt timeout retry cascade between create and respond

 Contract Constraints (Why This Works)

 - taskCreatedBlock < block.number β€” satisfied because gateway sets taskCreatedBlock = current_block - offset (offset >= 1)
 - taskCreatedBlock + taskCreationBufferWindow >= block.number β€” same constraint as today
 - checkSignatures(referenceBlockNumber = taskCreatedBlock) β€” BLS signature already computed offchain against this block
 - block.number <= taskCreatedBlock + taskResponseWindowBlock β€” trivially satisfied (same block)
 - allTaskHashes[taskId] written by createNewTask β€” visible within same EVM execution to respondToTask

 If createNewTask succeeds but respondToTask reverts within the same item, the entire item reverts (including the create). This is
 better than today β€” no zombie created-but-not-responded tasks.

 Files to Modify

 Solidity (contract + interface + bindings)

 1. contracts/src/BatchTaskManager.sol β€” Add batchCreateAndRespondToTasks + internal _createAndRespond
 2. contracts/src/interfaces/IBatchTaskManager.sol β€” Add interface method
 3. Run make generate-bindings to regenerate Rust ABI bindings

 Rust (gateway)

 4. crates/chainio/src/avs/writer.rs β€” Add batch_create_and_respond_to_tasks method on AvsWriter
 5. crates/gateway/src/task/submitter.rs β€” Add batch_create_and_respond_to_tasks to TaskSubmitter trait
 6. crates/gateway/src/rpc/api/batch_submitter.rs β€” Replace dual accumulators with single combined accumulator
 7. crates/gateway/src/rpc/api/sync.rs β€” Replace submit_create_task + submit_aggregated_response with single submit_task call

 Mock/test impls (add stub method)

 8. 8 TaskSubmitter impls β€” Add batch_create_and_respond_to_tasks stub
 9. Existing batch_submitter tests β€” Update for combined item type

 Implementation Steps

 Step 1: Add batchCreateAndRespondToTasks to BatchTaskManager contract

 File: contracts/src/BatchTaskManager.sol

 /// @inheritdoc IBatchTaskManager
 function batchCreateAndRespondToTasks(
     INewtonProverTaskManager.Task[] calldata tasks,
     INewtonProverTaskManager.TaskResponse[] calldata responses,
     bytes[] calldata signatureDataArray
 ) external onlyAuthorized {
     require(tasks.length > 0, EmptyBatch());
     require(
         tasks.length == responses.length && responses.length == signatureDataArray.length,
         ArrayLengthMismatch()
     );

     FailedItem[] memory failures = new FailedItem[](tasks.length);
     uint256 failCount;

     for (uint256 i; i < tasks.length;) {
         try this._createAndRespond(tasks[i], responses[i], signatureDataArray[i]) {}
         catch (bytes memory reason) {
             failures[failCount] = FailedItem(i, responses[i].taskId, reason);
             unchecked { ++failCount; }
         }
         unchecked { ++i; }
     }

     if (failCount > 0) {
         FailedItem[] memory trimmed = new FailedItem[](failCount);
         for (uint256 j; j < failCount;) {
             trimmed[j] = failures[j];
             unchecked { ++j; }
         }
         revert BatchPartialFailure(trimmed);
     }
 }

 /// @dev Internal helper called via this._createAndRespond() so try/catch works per item.
 ///      Must be external for try/catch but restricted to self-calls only.
 function _createAndRespond(
     INewtonProverTaskManager.Task calldata task,
     INewtonProverTaskManager.TaskResponse calldata taskResponse,
     bytes calldata signatureData
 ) external {
     require(msg.sender == address(this), Unauthorized());
     taskManager.createNewTask(task);
     taskManager.respondToTask(task, taskResponse, signatureData);
 }

 Update contracts/src/interfaces/IBatchTaskManager.sol to add:
 function batchCreateAndRespondToTasks(
     INewtonProverTaskManager.Task[] calldata tasks,
     INewtonProverTaskManager.TaskResponse[] calldata responses,
     bytes[] calldata signatureDataArray
 ) external;

 Step 2: Regenerate Rust bindings

 make generate-bindings

 This produces updated crates/core/src/generated/batch_task_manager.rs with the new function.

 Step 3: Add batch_create_and_respond_to_tasks to AvsWriter

 File: crates/chainio/src/avs/writer.rs

 Follow the same pattern as existing batch_create_tasks and batch_respond_to_tasks:
 - ABI-convert task/response types via encode/decode round-trip
 - Call contract.batchCreateAndRespondToTasks(tasks, responses, sigs).send().await
 - Wrap get_receipt() in tokio::time::timeout(BATCH_RECEIPT_TIMEOUT)
 - Check receipt.status()
 - On revert: attempt parse_batch_partial_failure()

 Step 4: Add to TaskSubmitter trait

 File: crates/gateway/src/task/submitter.rs

 async fn batch_create_and_respond_to_tasks(
     &self,
     batch_contract_addr: Address,
     tasks: Vec<Task>,
     responses: Vec<TaskResponse>,
     signature_data_array: Vec<Bytes>,
 ) -> Result<TransactionReceipt>;

 Keep existing batch_create_tasks and batch_respond_to_tasks β€” they remain useful for edge cases (direct on-chain tasks from
 watcher path only need respond, legacy fallback).

 Step 5: Rewrite BatchTransactionSubmitter with single accumulator

 File: crates/gateway/src/rpc/api/batch_submitter.rs

 New combined item type:
 struct BatchCombinedItem {
     task_id: TaskId,
     params: SendTaskParams,                           // for building Task
     task_request: TaskRequest,                        // for building respond params
     service_response: BlsAggregationServiceResponse,  // aggregated BLS data
     result_tx: oneshot::Sender<Result<(TransactionReceipt, Task)>>,
     queued_at: Instant,
 }

 New public API β€” single method replaces two:
 pub async fn submit_task(
     &self,
     task_id: TaskId,
     params: SendTaskParams,
     task_request: TaskRequest,
     service_response: BlsAggregationServiceResponse,
 ) -> Result<Receiver<Result<(TransactionReceipt, Task)>>>

 Single accumulator loop replaces create_accumulator_loop + respond_accumulator_loop:
 - Collects items, flushes on deadline (max age) or batch size
 - At flush time: calls aggregator.build_respond_params() for each item to get (Task, TaskResponse, Bytes) triples
 - Submits via task_submitter.batch_create_and_respond_to_tasks()
 - On success: sends Ok((receipt, task)) through each item's oneshot
 - On failure: same strip-and-retry as current flush_create_batch_with_retry

 Remove:
 - confirmed_creates: Arc<DashMap<TaskId, ()>> β€” no longer needed
 - respond_tx: mpsc::Sender<BatchRespondItem> β€” no longer needed
 - respond_accumulator_loop() β€” replaced by combined loop
 - flush_respond_batch() β€” logic merged into combined flush
 - BatchRespondItem struct
 - submit_aggregated_response() method on BatchTransactionSubmitter
 - submit_create_task() method β€” replaced by submit_task()
 - respond_notify channel between create and respond loops
 - mark_task_created() β€” no longer needed (was for watcher path pre-registration)

 Keep:
 - flush_create_batch_with_retry logic (reuse for combined flush, rename to flush_batch_with_retry)
 - identify_poison_items (adapts to combined items)
 - Panic isolation via catch_unwind
 - Deadline-based flush timing

 Step 6: Update sync.rs caller

 File: crates/gateway/src/rpc/api/sync.rs

 Current flow (two-step):
 Line 240-249: submit_create_task(task_id, params) β†’ Receiver
 Line 614:     tokio::spawn background task
 Line 626:     await task_creation_rx (wait for create receipt)
 Line 745-747: batch.submit_aggregated_response(task_id, task_request, service_response)

 New flow (single-step):
 After BLS aggregation completes:
   batch.submit_task(task_id, params, task_request, service_response) β†’ Receiver
   tokio::spawn background task:
     await result_rx (wait for combined receipt)
     On success: log, done
     On TaskAlreadyExists+TaskAlreadyResponded: idempotent success
     On error: cancel_task, cleanup

 The submit_task call moves AFTER BLS aggregation (around line 560-580), not at line 240 where submit_create_task currently lives.
 This is because service_response isn't available until aggregation completes.

 The early submit_create_task at line 240 was there so the create could be pipelined with BLS aggregation. With the combined
 approach, the create waits for BLS aggregation to complete first. The added latency is minimal: BLS aggregation takes 2-5s, and
 the batch accumulator adds ~1s max age. The overall task latency increases by at most the batch flush interval (1s), but we save
 one entire transaction and eliminate the cascading failure mode.

 Step 7: Handle watcher path (direct on-chain tasks)

 The watcher path (handler/mod.rs:watcher_event_handler) creates tasks directly on-chain (not via gateway RPC). These tasks already
  exist on-chain β€” they only need respond, not create. The current code calls mark_task_created() to pre-register in
 confirmed_creates.

 With the combined approach, the watcher path should continue using batch_respond_to_tasks (respond-only). Keep
 submit_aggregated_response specifically for the watcher path, using the existing batch_respond_to_tasks contract method. This
 means the respond accumulator loop stays but ONLY serves watcher-originated tasks.

 Alternative: if watcher volume is low, use aggregator.submit_aggregated_response() directly (non-batched) for watcher tasks. This
 avoids maintaining a second accumulator for a rare path.

 Step 8: Update mock TaskSubmitter implementations (8 impls)

 Add batch_create_and_respond_to_tasks returning Ok(dummy_receipt) or appropriate test response to all 8 impls:
 - SuccessSubmitter (batch_submitter.rs test)
 - FailingSubmitter (batch_submitter.rs test)
 - PoisonDetectingSubmitter (batch_submitter.rs test)
 - StallTaskSubmitter (tx_worker.rs test)
 - MockAvsWriter (chain_routing.rs test)
 - MockAvsWriter (common/mod.rs test)
 - MockAvsWriter (send_task.rs test)
 - DummySubmitter (secrets_encrypted.rs test)

 Step 9: Update batch_submitter tests

 Existing tests (test_batch_create_success, test_batch_create_with_poison_items, test_batch_create_panic_isolation,
 test_batch_submit_multiple_items) use submit_create_task. Update to use submit_task with the combined item type. Test scenarios:
 - Happy path: all items create+respond in one tx
 - Poison item: one item fails create (stripped, rest succeed)
 - TaskAlreadyExists: prior attempt landed (treated as idempotent success)
 - Panic isolation: panic in build_respond_params doesn't kill accumulator

 Failure Handling

 Same three-category model, now applied to combined items:

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚     Category      β”‚                        Selector / Pattern                        β”‚                 Action                 β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ Idempotent        β”‚ TaskAlreadyExists (0x2e98c533) or TaskAlreadyResponded           β”‚ Strip from batch, send success to      β”‚
 β”‚ success           β”‚ (0x68905dff)                                                     β”‚ caller                                 β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ Non-recoverable   β”‚ TaskMismatch, BLS verification failure, ABI decode failure       β”‚ Strip from batch, send error to caller β”‚
 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 β”‚ Transient         β”‚ RPC timeout, nonce conflict, connection error                    β”‚ Retry whole remaining batch with       β”‚
 β”‚                   β”‚                                                                  β”‚ backoff                                β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

 Key difference from current design: if createNewTask succeeds but respondToTask fails within the same _createAndRespond call, the
 entire item reverts including the create. This is better β€” no zombie tasks.

 Migration Path

 This is a contract change, so it requires:
 1. Deploy updated BatchTaskManager on all chains (Sepolia, Base Sepolia, mainnet when ready)
 2. The old batchCreateTasks and batchRespondToTasks methods remain β€” backward compatible
 3. Rust gateway uses the new method when available, falls back to separate calls if the contract doesn't have it

 For initial rollout, we can deploy the new contract first, then merge the Rust changes.

 Verification

 1. forge test β€” Solidity tests for new batchCreateAndRespondToTasks method
 2. make generate-bindings β€” Rust bindings regenerate cleanly
 3. make fmt && make clippy β€” no warnings
 4. cargo test -p newton-prover-gateway --lib β€” all gateway unit tests pass
 5. make gateway_e2e β€” single-chain E2E
 6. make e2e scenario=basic β€” unified E2E (production config path)

@denniswon
Copy link
Copy Markdown
Author

Load Test Analysis

Consensus Mode Segmentation

All 15 tasks used centralized mode (centralized_unified_quorum). Zero two-phase consensus or threshold tasks were in this load test.
So the failures are entirely in the batch submission pipeline, not consensus-mode-specific.

Timeline Reconstruction

Phase 1: Success (23:24:55 – 23:26:39) β€” 9 tasks across 7 batch flushes, all succeed

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Batch Flush β”‚ Items β”‚ Time to Confirm β”‚ Outcome β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:24:55 β”‚ 2 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:09 β”‚ 1 β”‚ 22s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:31 β”‚ 1 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:25:40 β”‚ 1 β”‚ 13s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:09 β”‚ 2 β”‚ 8s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:25 β”‚ 1 β”‚ 15s β”‚ all succeeded β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 23:26:39 β”‚ 1 β”‚ ... β”‚ cascade starts here β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 2: Cascade Begins (23:26:39 – 23:26:54) β€” Batch 8 (1 item)

  1. 23:26:39 β€” Batch 8 flushed (1 item for task 0xaf9d...)
  2. 23:26:54 β€” TransportError(NullResp) β€” RPC returned null (likely Sepolia node hiccup)
  3. Retry β†’ gas estimation call β†’ contract reverts (task already on-chain from the first attempt that got NullResp)
  4. could not parse revert data β€” parse_batch_partial_failure returns None (the known TODO)
  5. Simulation detects idempotent success β†’ 2 items stripped as "already on-chain"
  6. No healthy items remaining β†’ batch complete. These 2 tasks actually succeeded (idempotent).

Phase 3: Fatal Cascade (23:26:54 – 23:30:57) β€” Batch 9 (1 item) and Batch 10 (5 items)

Batch 9 (1 item, flushed 23:26:54):

  • 23:27:54 β€” Receipt timeout (60s) β€” tx likely dropped from mempool
  • Classified as transient β†’ retry
  • 23:28:55 β€” Receipt timeout again (60s)
  • Classified as transient β†’ retry
  • 23:28:55 β€” gas estimation reverts (item on-chain from prior attempt)
  • could not parse revert data β†’ simulation
  • Simulation classifies as poison (not idempotent) β†’ 1 "combined create+respond failed"
  • Batch 10 (5 items) flushed at 23:28:56

Batch 10 (5 items, flushed 23:28:56):

  • 23:29:56 β€” Receipt timeout (60s)
  • Classified as transient β†’ retry
  • 23:29:56 β€” gas estimation reverts
  • could not parse revert data β†’ simulation
  • 23:30:57 β€” ALL 5 items classified as poison β†’ 5 "combined create+respond failed"

Phase 4: Zombie Aggregators (23:30:51 – 23:34:06) β€” All 15 BLS aggregator slots expire

All 15 tasks had BLS aggregation slots that were never cancelled. They expire after 360s (6 min), each showing processed 0
signatures. The failed to send task expired error on 6 of them means the oneshot::Receiver was already dropped (the caller in sync.rs
moved on after batch failure).

Root Causes (Ordered by Severity)

  1. parse_batch_partial_failure always returns None (writer.rs:795)
    The as_revert_data() API exists and is already used at writer.rs:1091, but the batch parser is a TODO stub. This means every contract
    revert during retry falls through to simulation, which is slow and unreliable.

  2. Simulation misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
    When identify_combined_poison_items tests a single item, the contract wraps TaskAlreadyExists inside BatchPartialFailure. The
    is_task_already_exists() check sees BatchPartialFailure, not TaskAlreadyExists, and classifies the item as poison instead of
    idempotent.

  3. Receipt timeout (60s) is classified as transient and retried with same items
    After timeout, the first tx may have already landed. Retrying the same batch guarantees a contract revert. The system should
    pre-check on-chain state before retrying after a timeout.

  4. BLS aggregator slots are never cancelled on batch failure
    The cancel_task call exists in sync.rs error paths, but the batch submitter communicates failure via the oneshot channel. If the
    sync.rs caller has already timed out or dropped the receiver, cancel_task is never called. The slots linger for 360s consuming
    memory.

  5. Second batch (5 items) flushed while prior batch is still retrying
    The accumulator loop flushes new items even though the prior batch is stuck in the 60s receipt timeout loop. These 5 items enter the
    same failing retry cascade.

Quantitative Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ Value β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total tasks β”‚ 15 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Successfully on-chain β”‚ 9 (60%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Idempotent success (recovered) β”‚ 2 (13%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Failed (misclassified as poison) β”‚ 6 (40%) β€” all recoverable β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Zombie aggregator slots β”‚ 15 (100%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Receipt timeouts β”‚ 4 (across 2 batch retry cycles) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time wasted in retry loops β”‚ ~4 min β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The core issue hasn't changed from the prior session: implementing parse_batch_partial_failure would have prevented 100% of the
failures. The 6 failed tasks were all on-chain already β€” they were just misidentified as poison because the revert data couldn't be
decoded.

Want me to proceed with implementing the fixes? The priority order would be:

  1. Implement parse_batch_partial_failure (the single-function fix that unblocks the whole chain)
  2. Fix simulation to detect TaskAlreadyExists nested inside BatchPartialFailure
  3. Add pre-check on-chain state before retry after receipt timeout
  4. Ensure cancel_task is called from the batch submitter on final failure

@denniswon
Copy link
Copy Markdown
Author

❯ we need to fix all these issues. proceed. But we need much more fundamental fix to prevent Time wasted in retry loops (not
acceptable). we should leave NO TODOs at all in regards to this problem related items. Review this part of the codebase for onchain
transaction submission handling from the first principles and perform a thorough analysis on the architecture/design/code/logic before
implementing. /grill-me

⏺ Skill(grill-me)
⎿  Successfully loaded skill

⏺ Let me start by thoroughly reading the current batch submission pipeline to understand the architecture from first principles before
asking questions.

⏺ Explore(Read batch submission pipeline)
⎿  Done (56 tool uses · 143.1k tokens · 2m 16s)
(ctrl+o to expand)

⏺ Good, I have the full architecture. Now let me read the exact code for the critical paths to ground my understanding.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/CLAUDE.md
⎿  Loaded crates/chainio/src/CLAUDE.md
⎿  Loaded crates/chainio/src/avs/CLAUDE.md
⎿  Loaded crates/gateway/CLAUDE.md
⎿  Loaded crates/gateway/src/CLAUDE.md
⎿  Loaded crates/gateway/src/rpc/CLAUDE.md
⎿  Loaded crates/gateway/src/rpc/api/CLAUDE.md

⏺ Now let me read the identify_combined_poison_items function and the error type flow.

Read 2 files (ctrl+o to expand)

⏺ Now let me check how the error flows across the trait boundary β€” the TaskSubmitter wraps ChainIoError into GatewayError, which is the
key information loss point.

Read 2 files (ctrl+o to expand)
⎿  Loaded crates/gateway/src/task/CLAUDE.md

⏺ Now I have the complete picture of the architecture. Here's my first question:


Question 1: The TaskSubmitter trait erases ChainIoError into GatewayError::Internal(String). Should we change the trait's return type
to preserve structured errors?

The fundamental design flaw is here in submitter.rs:105-114:

async fn batch_create_and_respond_to_tasks(...) -> Result {
AvsWriter::batch_create_and_respond_to_tasks(...)
.await
.map_err(|e| GatewayError::Internal(format!("Failed to batch create and respond: {e}")))
}

This .map_err(format!(...)) destroys ALL structured error data. The ChainIoError::BatchPartialFailure { failures } with per-item
indices and reasons becomes the string "Failed to batch create and respond: Batch partial failure: 3 items failed". Then
extract_failed_indices in batch_submitter.rs tries to recover that data from the string β€” and obviously can't.

My recommendation: Change the TaskSubmitter trait's batch methods to return Result<TransactionReceipt, ChainIoError> directly instead
of Result<TransactionReceipt, GatewayError>. The batch submitter is the only consumer of these methods, and it needs the structured
error data. This is a one-layer change β€” batch_submitter.rs already imports ChainIoError types via newton_prover_chainio. The
GatewayError wrapping would move to the callers that actually need it (sync.rs), not happen at the trait boundary.

The alternative is to implement parse_batch_partial_failure in writer.rs so the structured data is decoded before it crosses the
trait boundary β€” but that only solves one error type. The trait-level fix solves all current and future structured error needs.

Which approach do you prefer?

✻ Crunched for 3m 5s

❯ agree with your recommendation

⏺ Question 2: Receipt timeout currently returns ChainIoError::CreateNewTaskCallFail (a string). After a timeout, the tx may have landed
on-chain. Should the batch submitter check on-chain state before retrying?

Currently at writer.rs:759-763:
let receipt = timeout(Self::BATCH_RECEIPT_TIMEOUT, pending.get_receipt())
.await
.map_err(|_| ChainIoError::CreateNewTaskCallFail {
reason: "batch create+respond receipt timed out after 60s".to_string(),
})?

This is classified as transient by is_transient_rpc_error (matches "timeout"), so the batch submitter retries the same items. But the
tx may have already been mined β€” the receipt poll just didn't catch it. The retry sends a duplicate tx, which reverts with
BatchPartialFailure(TaskAlreadyExists), triggering the broken simulation fallback.

My recommendation: Two changes working together:

  1. Return a distinct error variant β€” ChainIoError::TransactionTimeout { timeout_secs: 60 } (already exists in the enum but unused).
    This lets the batch submitter handle timeouts differently from other transient errors.
  2. In the batch submitter, on receipt timeout: instead of blind retry, call identify_combined_poison_items immediately. This
    simulates each item as a single-item batch. Items already on-chain get TaskAlreadyExists β†’ idempotent success. Items not on-chain
    pass simulation β†’ healthy, re-batched and retried. This eliminates the entire 60sΓ—3 retry cascade (the "4 minutes wasted" from the
    load test).

The alternative is adding an on-chain eth_call check per item (e.g., querying allTaskHashes[taskId]), but simulation already does
this implicitly and handles edge cases like partially-landed batches. The simulation path just needs the TaskAlreadyExists detection
to work β€” which it will once we fix the parse_batch_partial_failure at the writer level.

Does this approach make sense, or do you want a direct on-chain state check instead?

❯ what is your recommendation? is simulation an overkill compared to just eth call?

⏺ Good pushback. Let me think about this concretely.

Simulation = call batch_create_and_respond_to_tasks([single_item]) per item. This is a full contract call that tries to execute
create+respond, consuming significant gas estimation compute on the RPC node. For N items, that's N separate eth_calls, each encoding
the full batch contract ABI with task/response/signature data.

Direct on-chain check = call taskManager.allTaskHashes(taskId) per item. This is a single storage slot read β€” trivially cheap, ~200
gas equivalent, returns bytes32(0) if not created or the hash if created.

For the combined path, if allTaskHashes(taskId) != bytes32(0), the task was created AND responded (atomic), so it's idempotent
success. If it's zero, the task never landed β€” safe to retry.

My revised recommendation: Direct on-chain check via allTaskHashes. It's:

  • 10-100x cheaper per item than simulation
  • Semantically precise (answers exactly "did this task land?")
  • No false positives from unrelated revert reasons
  • Can check all items in a single multicall or sequential loop

We'd add a method to TaskSubmitter like is_task_on_chain(task_id) -> bool that reads allTaskHashes, and call it on receipt timeout
before deciding whether to retry or declare idempotent success.

Simulation stays as the last-resort fallback only for non-timeout contract reverts where we can't parse BatchPartialFailure (which
should be rare once we implement the parser).

❯ sounds good

⏺ Question 3: The identify_combined_poison_items simulation sends actual on-chain transactions (not eth_call). Each "simulation" is a
real batch_create_and_respond_to_tasks call that goes through .send().await and get_receipt(). Is this intentional?

Looking at batch_submitter.rs:648-655:
let sim_result = task_submitter
.batch_create_and_respond_to_tasks(
batch_contract_addr,
vec![task.clone()],
vec![response.clone()],
vec![sig.clone()],
)
.await;

This calls the real AvsWriter::batch_create_and_respond_to_tasks, which calls .send().await β€” submitting an actual transaction, not a
read-only eth_call. If the task is NOT on-chain, the "simulation" actually creates+responds to it on-chain as a side effect. If the
task IS on-chain, the .send() call reverts during gas estimation (before tx submission), so no tx is sent.

This means:

  • For items that are already on-chain: reverts at gas estimation β†’ correctly classified
  • For healthy items: actually submits a real tx β†’ succeeds, but now you have N individual txs instead of one batch

My recommendation: This is actually fine functionally (items get submitted either way), but it's wasteful β€” defeats the purpose of
batching. However, since we're moving to direct allTaskHashes checks for receipt timeouts, simulation will only be reached for
non-timeout contract reverts where parse_batch_partial_failure fails. Once we implement the parser, simulation becomes a truly rare
last-resort fallback.

So: keep simulation as-is for now, but it should almost never be reached after our fixes. The priority path becomes: parse structured
error β†’ strip β†’ retry. Timeout path becomes: check allTaskHashes β†’ classify β†’ retry healthy only.

Agree with keeping simulation as the rare fallback, or do you want to convert it to use eth_call instead of real transactions?

❯ which one is better? which one do you suggest?

⏺ Converting to eth_call is strictly better. Here's why:

Real tx submission (current):

  • Healthy items get submitted as individual txs (N txs instead of 1 batch β€” wastes gas and nonce space)
  • Each goes through the 60s receipt timeout path, potentially cascading further
  • If 2 of 5 items are poison, the 3 healthy ones get submitted individually rather than re-batched

eth_call simulation (proposed):

  • Zero on-chain side effects β€” purely diagnostic
  • Fast (no mempool, no receipt wait β€” just gas estimation)
  • After classification, healthy items stay in the batch for a single re-batched submission
  • No nonce consumption

My recommendation: Convert identify_combined_poison_items to use eth_call via a new TaskSubmitter method like
simulate_batch_create_and_respond(single_item) -> Result<(), ChainIoError>. Under the hood, this calls .call().await instead of
.send().await on the contract instance. The classification logic stays identical β€” we just use the return value to determine healthy
vs poison vs already-exists without actually submitting.

This also eliminates a subtle bug: if simulation succeeds for a "healthy" item (submitting a real tx), and then the re-batched retry
also includes it, you get a TaskAlreadyExists from the item you just accidentally submitted during simulation.

❯ sounds good

⏺ Question 4: The batch accumulator loop flushes new items while a prior batch is stuck in the retry/timeout loop. Should we block
accumulation during flush, or decouple them?

From the load test: Batch 9 (1 item) entered a 60sΓ—2 receipt timeout loop. Meanwhile, 5 new items completed BLS aggregation and
queued up. When batch 9 finally failed at 23:28:56, those 5 items were immediately flushed as batch 10 β€” which then hit the same
cascading timeout pattern.

The current architecture runs the accumulator and flush in the same tokio::spawn task. While flush_combined_batch_with_retry is
await-ing receipt timeouts (up to 180s with 3 retries), the mpsc channel buffers incoming items. When flush returns, the loop
immediately drains the channel and flushes the next batch.

My recommendation: This is actually the correct design β€” we should NOT block accumulation. Items need somewhere to queue while flush
is in progress. The real fix is making flush fast:

  1. Receipt timeout β†’ immediate allTaskHashes check (seconds, not 60s retry)
  2. parse_batch_partial_failure works β†’ instant strip-and-retry (no simulation)
  3. Simulation (rare fallback) uses eth_call β†’ fast, no receipt wait

With these fixes, the worst-case flush time drops from 180s+ (3Γ— receipt timeout) to ~5-10s (one timeout + quick on-chain check + one
retry). The accumulator loop stays as-is β€” the problem was never the architecture, it was the slow error recovery paths.

One small addition: we should cap the number of items that accumulate during a flush to prevent unbounded growth. The channel is
already bounded (max_create_batch_size * 2 = 240), which provides backpressure. That's sufficient.

@denniswon
Copy link
Copy Markdown
Author

❯ await-ing receipt timeouts (up to 180s with 3 retries) --> this is way too long. definitely remove and retries should be just 2 for
batches. agreed overall, but double evaluate your suggestion/recommendation

⏺ You're right β€” let me re-evaluate.

The 60s receipt timeout itself is the problem. On Sepolia, block time is 12s. A legitimate transaction should be mined within 1-2
blocks (12-24s). If we haven't gotten a receipt in 30s, the tx is almost certainly either:

  1. Dropped from mempool (nonce replaced, gas too low)
  2. Mined but the RPC node is lagging

Waiting 60s before checking is just dead time. And with 3 retries, that's potentially 180s of waiting for nothing.

My revised recommendation:

  1. Reduce BATCH_RECEIPT_TIMEOUT to 30s β€” still generous for Sepolia (2.5 blocks). On L2s like Base (2s block time), 30s is 15 blocks
    β€” more than enough.
  2. Max retries = 2 for batch methods (down from 3). After the first timeout, we check allTaskHashes. If items are on-chain β†’
    idempotent success. If not β†’ one more retry with fresh nonce. If that also fails, items are failed out.
  3. On receipt timeout, don't retry the batch blindly. The sequence becomes:
    - Timeout fires after 30s
    - Check allTaskHashes for each item in the batch (~1-2s)
    - Items on-chain β†’ idempotent success
    - Items NOT on-chain β†’ re-batch and retry once (attempt 2)
    - If attempt 2 also times out β†’ fail remaining items

Worst case: 30s timeout + 2s check + 30s retry = ~62s. Current worst case: 180s+. That's a 3x improvement, and the common case (items
landed but receipt timed out) resolves in ~32s.

The max_retries = 2 config applies to the batch submitter config, not the writer-level timeout. The writer always does a single
attempt with timeout β€” retries are orchestrated by the batch submitter.

❯ sounds good.

⏺ Question 5: The parse_batch_partial_failure implementation. The as_revert_data() API is already proven at writer.rs:1091. Should we
decode BatchPartialFailure at the writer level (before the trait boundary) or at the batch submitter level?

Looking at the two options:

Option A: Decode at writer level (current stub location, writer.rs:795):

  • parse_batch_partial_failure uses error.as_revert_data() to get raw bytes
  • Checks for 0x19ed9977 selector, decodes FailedItem[] via BatchPartialFailure::abi_decode_raw
  • Returns Some(Vec) with per-item index, task_id, and nested revert reason
  • The existing Err(ChainIoError::BatchPartialFailure { failures }) path already works β€” it just never gets populated because the
    parser returns None

Option B: Decode at batch submitter level:

  • Would require passing the raw alloy::contract::Error through the trait somehow
  • Trait returns Result<_, ChainIoError> (per Question 1), but ContractError is already a variant

My recommendation: Option A β€” decode at the writer level. This is the natural location because:

  1. The writer has direct access to alloy::contract::Error and as_revert_data()
  2. The ChainIoError::BatchPartialFailure variant already exists with the right fields
  3. With the Question 1 fix (trait returns ChainIoError), the batch submitter gets the structured BatchPartialFailure { failures }
    directly
  4. extract_failed_indices in batch_submitter becomes trivial: just match on the ChainIoError::BatchPartialFailure variant and extract
    indices

The nested revert reason bytes inside each FailedItem.reason should also be decoded where possible β€” check for TaskAlreadyExists
(0x2e98c533), TaskAlreadyResponded (0x68905dff), and other known selectors. This lets strip_combined_by_indices classify each
stripped item correctly (idempotent vs poison) without simulation.

@denniswon
Copy link
Copy Markdown
Author

Excellent point. Let me think about this more carefully.

The root of the cascading failure was ambiguity in error classification. When the system couldn't definitively identify an error, it
fell back to heuristics (string matching, simulation) that guessed wrong. We need to eliminate guessing entirely.

The definitive approach: Every error that crosses the writer β†’ batch submitter boundary must be exhaustively classified at the source
β€” the writer level β€” where we have full access to the raw error type.

Here's the concrete design:

At writer.rs, the batch_create_and_respond_to_tasks method currently has two exit paths:

  1. .send().await succeeds β†’ pending.get_receipt() β†’ either success or timeout
  2. .send().await fails β†’ ContractError (gas estimation revert)

For path 2, instead of the current parse_batch_partial_failure || fallback to ContractError, we do exhaustive error decoding:

let revert_data = error.as_revert_data();
match revert_data {
None => ChainIoError::ContractError(e), // not a revert (RPC/transport error)
Some(data) if data.is_empty() => ChainIoError::CreateNewTaskCallFail {
reason: "empty revert (assembly-level failure)"
},
Some(data) if data.len() < 4 => ChainIoError::CreateNewTaskCallFail {
reason: "malformed revert data (< 4 bytes)"
},
Some(data) => {
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
BatchPartialFailure::SELECTOR => // decode FailedItem[], return BatchPartialFailure
TaskAlreadyExists::SELECTOR => // return new variant TaskAlreadyExists
TaskAlreadyResponded::SELECTOR => // return new variant TaskAlreadyResponded
TaskCreatedBlockTooOld::SELECTOR => // return specific variant
_ => // return ContractRevert { selector, data } β€” preserves raw bytes
}
}
}

Key principle: the selector is always available and always definitive. We decode what we recognize, and for anything we don't
recognize, we preserve the raw selector + data in a new ChainIoError::ContractRevert { selector: [u8; 4], data: Bytes } variant. No
string matching. No guessing. No fallback to simulation for classification.

Then the BatchPartialFailure payload itself contains FailedItem[] where each item has a reason: bytes β€” that's the nested revert from
each individual _createAndRespond call. We apply the same selector-based decoding to each nested reason:

for failure in &failures {
let nested_selector = failure.reason[..4];
match nested_selector {
TaskAlreadyExists::SELECTOR => classification = Idempotent,
TaskAlreadyResponded::SELECTOR => classification = Idempotent,
_ => classification = Poison(selector, data),
}
}

This way the batch submitter never needs to string-match, simulate, or guess. Every item is classified definitively by its 4-byte
selector at the writer level.

For receipt timeout (path 1): we already agreed β€” direct allTaskHashes check, not guessing.

Simulation becomes truly dead-letter: only reachable if somehow as_revert_data() returns None for a contract revert (which would be
an alloy bug, not our code). We can log a warning and still fall through to eth_call-based simulation as the absolute last resort,
but it should never fire in practice.

Does this exhaustive selector-based decoding approach address your concern?

@denniswon
Copy link
Copy Markdown
Author

Question 6: We need new ChainIoError variants for the exhaustive classification. Here's the proposed enum expansion:

Currently the batch path can produce these ChainIoError variants:

  • CreateNewTaskCallFail { reason: String } β€” catch-all string bucket
  • BatchPartialFailure { failures } β€” structured but never populated
  • ContractError(alloy::contract::Error) β€” raw alloy error
  • AlloyProviderError β€” provider-level error

Proposed additions:

// New variants in ChainIoError:

/// Task already exists on-chain (idempotent β€” prior attempt landed)
TaskAlreadyExists { task_id: Option },

/// Task already responded (idempotent β€” prior create+respond landed)
TaskAlreadyResponded { task_id: Option },

/// Contract reverted with a recognized but non-recoverable error
ContractRevert { selector: [u8; 4], data: Bytes },

/// Receipt wait timed out β€” tx may or may not be on-chain
ReceiptTimeout { timeout_secs: u64 },

And update BatchPartialFailure.failures items to carry a classification enum instead of raw bytes:

pub enum BatchItemFailure {
AlreadyExists,
AlreadyResponded,
Revert { selector: [u8; 4], data: Bytes },
Unknown { data: Bytes },
}

pub struct BatchFailedItem {
pub index: usize,
pub task_id: B256,
pub failure: BatchItemFailure,
}

This means extract_failed_indices in the batch submitter becomes a simple pattern match β€” no string parsing, no simulation. Each
failed item is already classified.

My recommendation: This is the right set of variants. ReceiptTimeout replaces the current CreateNewTaskCallFail { reason: "...timed
out..." } string hack. TaskAlreadyExists/TaskAlreadyResponded surface directly instead of being detected via string matching.
ContractRevert catches everything else with preserved data.

The task_id: Option on TaskAlreadyExists/TaskAlreadyResponded is Option because when a whole batch reverts with these, we may
not know which specific task β€” but for BatchPartialFailure nested reasons, we always know the task_id from the FailedItem.

Agree with these variants?

@denniswon
Copy link
Copy Markdown
Author

Good question. Let me evaluate both approaches for check_tasks_on_chain.

Sequential reads vs Multicall

Context: After a receipt timeout on a batch of N items, we need to check allTaskHashes[taskId] for each item to determine which tasks
actually landed on-chain.

Option A: Sequential reads

for item in &items {
let hash = task_manager.allTaskHashes(item.task_id).call().await?;
landed.push(hash != B256::ZERO);
}

Pros:

  • Simple, no extra dependency
  • Each call is independent β€” partial failure doesn't lose results
  • alloy already has the generated binding for allTaskHashes

Cons:

  • N RPC round-trips (one per task in the batch)
  • At batch size 120, that's 120 sequential calls after a timeout β€” adding 10-30s latency on Sepolia
  • Compounds the timeout problem we're trying to fix

Option B: Multicall3

let multicall = Multicall3::new(provider, MULTICALL3_ADDR);
let calls: Vec<_> = items.iter().map(|item| Call3 {
target: task_manager_addr,
callData: allTaskHashesCall { taskId: item.task_id }.abi_encode(),
allowFailure: true,
}).collect();
let results = multicall.aggregate3(calls).call().await?;

Pros:

  • Single RPC round-trip regardless of batch size
  • 120 items checked in ~200ms instead of ~30s
  • Multicall3 is deployed at a deterministic address on all EVM chains (including anvil with --auto-impersonate)

Cons:

  • Needs Multicall3 address constant (but it's 0xcA11bde05977b3631167028862bE2a173976CA11 on every chain)
  • Slightly more complex ABI encoding
  • If Multicall3 isn't deployed (unlikely β€” it's on 99%+ of chains), entire check fails

Option C: eth_call batch simulation (middle ground)

// Simulate the entire batch via eth_call β€” if it succeeds, nothing is on-chain yet
// If it reverts with BatchPartialFailure, decode which items exist
let result = batch_contract
.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await;

This is what we already planned for simulate_batch_create_and_respond. It doubles as the on-chain check β€” items that revert with
TaskAlreadyExists are confirmed on-chain.

Pros:

  • Single RPC call
  • Reuses existing contract ABI (no Multicall3 dependency)
  • Gives us BOTH information: what's on-chain AND what would succeed if resubmitted

Cons:

  • Heavier than a simple storage read (runs full EVM execution per item)
  • Gas estimation overhead (~400K per item Γ— 120 = 48M gas simulated)

My recommendation: Option C (eth_call simulation)

Here's why β€” we already need simulate_batch_create_and_respond for the poison detection path. After a receipt timeout, the question
isn't just "which tasks landed?" β€” it's "which tasks can be resubmitted?" The eth_call simulation answers both in one call:

  • Items that revert with TaskAlreadyExists (0x2e98c533) β†’ confirmed on-chain, send idempotent success
  • Items that revert with TaskAlreadyResponded (0x68905dff) β†’ confirmed on-chain AND responded, send idempotent success
  • Items that succeed β†’ not on-chain, safe to retry
  • Items that revert with other errors β†’ poison, strip them

This means we don't need a separate check_tasks_on_chain method at all. The receipt timeout path becomes:

timeout β†’ simulate_batch(remaining_items) β†’ classify each item β†’ retry/succeed/fail

One method, one RPC call, complete information.

β˜… Insight ─────────────────────────────────────
The key insight is that eth_call simulation is strictly more informative than allTaskHashes reads. A storage read tells you "this
task exists on-chain." A simulation tells you "this task exists on-chain AND here's exactly what would happen if you tried to
resubmit it." Since we need the simulation for poison detection anyway, using it for the timeout path too eliminates an entire method
from the trait surface.
─────────────────────────────────────────────────

So the revised TaskSubmitter trait change is:

// Add ONE new method (not two):
async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError>;
// Returns Ok(()) if all would succeed, BatchPartialFailure if some fail,
// or other ChainIoError for RPC issues

// Change return type on existing method:
async fn batch_create_and_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
// (was Result with anyhow)

Does this approach make sense, or do you still want the separate check_tasks_on_chain method?

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #42/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7 β€” whether Option C (single simulate_batch_create_and_respond method that handles both timeout
recovery and poison detection) works, or if you'd prefer separate methods.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #43/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Still waiting on your input for Q7 before proceeding with implementation.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #44/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #45/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #46/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your input on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #47/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #48/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #49/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

⏺ Ran 3 stop hooks (ctrl+o to expand)
⎿  Stop hook error: [ECOMODE #50/50] Mode active. If all work is complete, run /oh-my-claudecode:cancel to cleanly exit
ecomode and clean up state files. If cancel fails, retry with /oh-my-claudecode:cancel --force. Otherwise, continue working.

⏺ Waiting for your response on Q7.

✻ Cooked for 1m 23s

@denniswon
Copy link
Copy Markdown
Author

⏺ Plan is written. Let me exit plan mode so you can review and approve before implementation begins.

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ready to code?

Here is Claude's plan:
β•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œβ•Œ
Harden Batch Transaction Submission Pipeline

Context

PR #507 (feat/batch-create-and-respond) merged and deployed to stagef+prod. Load testing (15 tasks, centralized mode, single gateway)
showed 60% success (9/15) with 6 recoverable tasks misclassified as poison. Root cause chain:

  1. TransportError(NullResp) on batch 8 triggers retry
  2. Receipt timeout (60s x 3 retries = 180s) blocks accumulator
  3. parse_batch_partial_failure() returns None (stub) β€” falls to simulation
  4. Simulation sends REAL transactions (not eth_call), misclassifies TaskAlreadyExists inside BatchPartialFailure as poison
  5. 6 tasks that were actually on-chain get failed to callers
  6. 15 zombie BLS aggregator slots linger 360s

Every failure in this cascade is recoverable. The protocol is correct β€” the error handling pipeline loses information at each layer.

Design Decisions (from grill-me interview)

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Decision β”‚ Rationale β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ TaskSubmitter returns Result<_, ChainIoError> β”‚ Stop losing structured error data via format!() β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Receipt timeout β†’ eth_call simulation β”‚ Determine which items landed on-chain without blind β”‚
β”‚ β”‚ β”‚ retry β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3 β”‚ Poison detection via eth_call (not real txs) β”‚ Current simulation sends real transactions β€” wasteful β”‚
β”‚ β”‚ β”‚ and dangerous β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4 β”‚ Keep accumulator architecture, timeout=30s, max_retries=2 β”‚ Accumulator is sound; reduce wasted time in retry loops β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ Decode BatchPartialFailure at writer level β”‚ Use as_revert_data() (proven at writer.rs:1091) β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6 β”‚ Exhaustive selector-based decoding β”‚ No string matching, no guessing, no fallback defaults β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7 β”‚ Single simulate_batch method for both timeout recovery and β”‚ One eth_call gives complete classification β”‚
β”‚ β”‚ poison detection β”‚ β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Files to Modify

Layer 1: Error types and classification

  1. crates/chainio/src/error.rs β€” Add new ChainIoError variants, expand is_transient()
  2. crates/chainio/src/avs/errors.rs β€” Add exhaustive selector constants and classify_batch_item_error()

Layer 2: Writer (on-chain interaction)

  1. crates/chainio/src/avs/writer.rs β€” Implement parse_batch_partial_failure(), reduce timeout, add
    simulate_batch_create_and_respond()

Layer 3: TaskSubmitter trait bridge

  1. crates/gateway/src/task/submitter.rs β€” Change return types to ChainIoError, add simulate_batch_create_and_respond()

Layer 4: Batch submitter (retry logic)

  1. crates/gateway/src/rpc/api/batch_submitter.rs β€” Rewrite flush logic to use structured errors, remove string matching

Layer 5: Config

  1. crates/gateway/src/rpc/api/batch_submitter.rs β€” Change defaults: max_retries=2

Layer 6: Mock impls (10 total)

  1. All TaskSubmitter test impls β€” add simulate_batch_create_and_respond(), update return types

Step-by-Step Implementation

Step 1: New ChainIoError variants (crates/chainio/src/error.rs)

Add these variants to the existing enum:

/// Task already exists on-chain (idempotent β€” not an error)
#[error("Task {task_id} already exists on-chain")]
TaskAlreadyExists { task_id: B256 },

/// Task already responded to on-chain (idempotent β€” not an error)
#[error("Task {task_id} already responded")]
TaskAlreadyResponded { task_id: B256 },

/// Contract reverted with a known error selector
#[error("Contract reverted: {name} (0x{selector})")]
ContractRevert {
selector: String, // 4-byte hex, e.g. "2e98c533"
name: String, // e.g. "TaskAlreadyExists"
raw_data: Vec, // full revert data for downstream decoding
},

/// Per-item failure within a batch, with selector-decoded reason
#[derive(Debug, Clone)]
pub enum BatchItemError {
TaskAlreadyExists,
TaskAlreadyResponded,
ContractRevert { selector: String, name: String },
Unknown { raw: Vec },
}

Update is_transient() to also match TransactionTimeout:

pub fn is_transient(&self) -> bool {
matches!(
self,
ChainIoError::RpcError()
| ChainIoError::AlloyProviderError(
)
| ChainIoError::TransactionTimeout { .. }
)
}

Step 2: Exhaustive selector classification (crates/chainio/src/avs/errors.rs)

Add classify_batch_item_revert(data: &[u8]) -> BatchItemError:

pub fn classify_batch_item_revert(data: &[u8]) -> BatchItemError {
if data.len() < 4 {
return BatchItemError::Unknown { raw: data.to_vec() };
}
let selector: [u8; 4] = data[..4].try_into().unwrap();
match selector {
[0x2e, 0x98, 0xc5, 0x33] => BatchItemError::TaskAlreadyExists,
[0x68, 0x90, 0x5d, 0xff] => BatchItemError::TaskAlreadyResponded,
_ => {
let name = decode_error_selector(selector)
.unwrap_or_else(|| format!("0x{}", hex::encode(selector)));
BatchItemError::ContractRevert {
selector: hex::encode(selector),
name,
}
}
}
}

Where decode_error_selector is the existing function in writer.rs that maps selectors to names. We'll make it pub and move the
selector-to-name mapping to errors.rs.

Step 3: Implement parse_batch_partial_failure() (crates/chainio/src/avs/writer.rs)

Replace the stub at line 795 with real implementation using as_revert_data():

fn parse_batch_partial_failure(
error: &alloy::contract::Error,
) -> Option<Vec> {
let revert_data = error.as_revert_data()?;
if revert_data.len() < 4 {
return None;
}
let selector: [u8; 4] = revert_data[..4].try_into().ok()?;
// BatchPartialFailure selector: 0x19ed9977
if selector != [0x19, 0xed, 0x99, 0x77] {
return None;
}
// Decode: BatchPartialFailure(FailedItem[] failures)
// FailedItem = (uint256 index, bytes32 taskId, bytes reason)
// Use alloy ABI decoding
use alloy::sol_types::SolType;
type FailedItemTuple = alloy::sol_types::sol_data::Array<
(alloy::sol_types::sol_data::Uint<256>,
alloy::sol_types::sol_data::FixedBytes<32>,
alloy::sol_types::sol_data::Bytes)
>;
let decoded = FailedItemTuple::abi_decode(&revert_data[4..], true).ok()?;
let items = decoded.into_iter().map(|(index, task_id, reason)| {
BatchFailedItem {
index: index.to::(),
task_id: B256::from(task_id),
reason: reason.into(),
}
}).collect();
Some(items)
}

On error path in batch_create_and_respond_to_tasks, use it to return ChainIoError::BatchPartialFailure:

Err(e) => {
if let Some(failures) = parse_batch_partial_failure(&e) {
return Err(ChainIoError::BatchPartialFailure { failures });
}
// Classify the top-level revert
if let Some(revert_data) = e.as_revert_data() {
let item_err = errors::classify_batch_item_revert(&revert_data);
// Convert to appropriate ChainIoError variant
return Err(match item_err { ... });
}
Err(ChainIoError::ContractError(e))
}

Step 4: Add simulate_batch_create_and_respond() to AvsWriter

New method using .call() (eth_call) instead of .send():

pub async fn simulate_batch_create_and_respond(
&self,
batch_contract_addr: Address,
tasks: Vec,
responses: Vec,
signature_data_array: Vec,
) -> Result<(), ChainIoError> {
// Same ABI encoding as batch_create_and_respond_to_tasks
// but use .call().await instead of .send().await
contract.batchCreateAndRespondToTasks(tasks, responses, sigs)
.call()
.await
.map(|_| ())
.map_err(|e| {
if let Some(failures) = parse_batch_partial_failure(&e) {
return ChainIoError::BatchPartialFailure { failures };
}
ChainIoError::ContractError(e)
})
}

Step 5: Reduce BATCH_RECEIPT_TIMEOUT (crates/chainio/src/avs/writer.rs)

const BATCH_RECEIPT_TIMEOUT: Duration = Duration::from_secs(30); // was 60

Step 6: Change TaskSubmitter trait (crates/gateway/src/task/submitter.rs)

Change batch method return types from Result (anyhow) to Result<TransactionReceipt, ChainIoError>:

async fn batch_create_and_respond_to_tasks(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<TransactionReceipt, ChainIoError>;

async fn simulate_batch_create_and_respond(
&self, batch_contract_addr: Address, tasks: Vec,
responses: Vec, signature_data_array: Vec,
) -> Result<(), ChainIoError>;

// Also change these existing methods:
async fn batch_create_tasks(...) -> Result<TransactionReceipt, ChainIoError>;
async fn batch_respond_to_tasks(...) -> Result<TransactionReceipt, ChainIoError>;

Remove the format!() error wrapping in the AvsWriter impl β€” return ChainIoError directly.

Step 7: Rewrite flush_combined_batch_with_retry() (crates/gateway/src/rpc/api/batch_submitter.rs)

The core rewrite. Replace string-matching error classification with pattern matching on ChainIoError variants.

Change max_retries default from 3 to 2:
fn default_batch_max_retries() -> u32 { 2 } // was 3

New flush logic (pseudocode):

loop {
match task_submitter.batch_create_and_respond_to_tasks(...).await {
Ok(receipt) => {
// Success β€” notify all callers
for (item, task, ..) in built_items {
let _ = item.result_tx.send(Ok((receipt.clone(), task)));
}
return;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Structured per-item failures β€” classify each
let failed_map: HashMap<usize, &BatchFailedItem> = ...;
let mut healthy = vec![];
for (i, item) in built_items.into_iter().enumerate() {
if let Some(failure) = failed_map.get(&i) {
let classification = classify_batch_item_revert(&failure.reason);
match classification {
BatchItemError::TaskAlreadyExists
| BatchItemError::TaskAlreadyResponded => {
// Idempotent success
let _ = item.0.result_tx.send(Ok((dummy_receipt(), item.1)));
}
_ => {
// Non-recoverable β€” fail this item
let _ = item.0.result_tx.send(Err(GatewayError::Internal(...)));
}
}
} else {
healthy.push(item);
}
}
if healthy.is_empty() { return; }
built_items = healthy;
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue; // retry healthy remainder
}
Err(e) if e.is_transient() => {
attempt += 1;
if attempt > max_retries { fail_all; return; }
tokio::time::sleep(backoff).await;
backoff = min(backoff * 2, Duration::from_secs(5));
continue;
}
Err(ChainIoError::TransactionTimeout { .. }) => {
// Receipt timeout β€” simulate to classify
match task_submitter.simulate_batch_create_and_respond(...).await {
Ok(()) => {
// Nothing on-chain yet β€” safe to retry
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
Err(ChainIoError::BatchPartialFailure { failures }) => {
// Some items landed β€” classify each (same as above)
...
}
Err(e) => {
// Simulation itself failed β€” treat as transient
attempt += 1;
if attempt > max_retries { fail_all; return; }
continue;
}
}
}
Err(e) => {
// Unknown non-transient error β€” try simulation to salvage
match task_submitter.simulate_batch_create_and_respond(...).await {
// Same classification as timeout path
...
}
}
}
}

Remove:

  • extract_failed_indices() β€” replaced by direct pattern match on BatchPartialFailure
  • identify_combined_poison_items() β€” replaced by simulate_batch_create_and_respond()
  • is_transient_error() helper β€” replaced by ChainIoError::is_transient()
  • All error_str.contains(...) string matching

Step 8: Update identify_combined_poison_items β†’ simulation via eth_call

Replace the current implementation that sends REAL single-item batch transactions with eth_call simulation:

async fn simulate_and_classify_items(
items: Vec,
task_submitter: &Arc,
batch_contract_addr: Address,
) -> (Vec, Vec, Vec) {
// (healthy, poison, idempotent)
// For each item, call simulate_batch_create_and_respond with a single-item batch
// Classify the result using BatchItemError
}

Step 9: Return TransactionTimeout from writer on receipt timeout

Currently the timeout path in batch_create_and_respond_to_tasks returns a generic error. Change to:

Err(_elapsed) => {
Err(ChainIoError::TransactionTimeout { timeout_secs: BATCH_RECEIPT_TIMEOUT.as_secs() })
}

Step 10: Update all mock TaskSubmitter implementations (10 impls)

Add simulate_batch_create_and_respond returning Ok(()) (default: all items would succeed).
Change return types on batch methods from Result to Result<TransactionReceipt, ChainIoError>.

Impls to update:

  1. SuccessSubmitter β€” batch_submitter.rs tests
  2. FailingSubmitter β€” batch_submitter.rs tests
  3. PoisonDetectingSubmitter β€” batch_submitter.rs tests
  4. StallTaskSubmitter β€” tx_worker.rs tests
  5. MockAvsWriter β€” chain_routing.rs tests
  6. MockAvsWriter β€” common/mod.rs tests
  7. MockAvsWriter β€” send_task.rs tests
  8. DummySubmitter β€” secrets_encrypted.rs tests
  9. RecordingTaskSubmitter β€” create_task_proof_cid.rs tests
  10. AvsWriter production impl β€” submitter.rs

Step 11: Update batch_submitter tests

Update existing tests to verify new error handling:

  • test_batch_create_success β†’ verify structured receipt propagation
  • test_batch_create_with_poison_items β†’ verify BatchPartialFailure decoding and strip
  • test_batch_create_panic_isolation β†’ unchanged (panic handling)
  • test_batch_submit_multiple_items β†’ verify idempotent success for TaskAlreadyExists

Add new tests:

  • test_batch_timeout_then_simulate β€” receipt timeout β†’ simulation classifies items
  • test_batch_partial_failure_decoding β€” verify BatchPartialFailure selector decoding
  • test_batch_max_retries_2 β€” verify max 2 retries (not 3)

Verification

  1. make fmt && make clippy β€” no warnings
  2. cargo test -p newton-prover-chainio --lib β€” writer/error tests pass
  3. cargo test -p newton-prover-gateway --lib β€” batch_submitter tests pass
  4. make gateway_e2e β€” single-chain E2E
  5. make e2e scenario=basic β€” unified E2E (production config path)

@denniswon
Copy link
Copy Markdown
Author

β˜… Insight ─────────────────────────────────────
Here's what this implementation achieves across the full error pipeline:

  1. No more string matching: Every error classification uses typed ChainIoError variants or 4-byte selector matching via
    classify_batch_item_revert(). Zero contains() calls on error strings.
  2. No more stub functions: parse_batch_partial_failure() now uses as_revert_data() to extract and ABI-decode FailedItem[] from
    BatchPartialFailure reverts. extract_failed_indices() (the empty stub) is deleted entirely.
  3. Simulation uses eth_call, not real txs: simulate_and_classify() calls simulate_batch_create_and_respond() which uses .call()
    (read-only) instead of .send(). No more wasting gas on poison detection.
  4. Reduced retry budget: max_retries defaults to 2 (was 3), BATCH_RECEIPT_TIMEOUT is 30s (was 60s). Worst-case cascade: 30s Γ— 2 =
    60s, down from 60s Γ— 3 = 180s.
    ─────────────────────────────────────────────────

Here's a summary of all changes made:

Layer 1: Error types (crates/chainio/src/error.rs)

  • Added BatchItemError enum with selector-based classification
  • Added ChainIoError variants: TaskAlreadyExists, TaskAlreadyResponded, ContractRevert, TransactionTimeout, RetriesExhausted,
    TransactionReverted
  • Added is_transient() (includes TransactionTimeout) and is_idempotent() methods

Layer 2: Error classification (crates/chainio/src/avs/errors.rs)

  • Added selectors module with byte-array constants
  • Added classify_batch_item_revert() β€” exhaustive 4-byte selector classification
  • Added classify_top_level_revert() β€” top-level revert to ChainIoError variant

Layer 3: Writer (crates/chainio/src/avs/writer.rs)

  • BATCH_RECEIPT_TIMEOUT: 60s β†’ 30s
  • Implemented parse_batch_partial_failure() (was stub returning None)
  • Added simulate_batch_create_and_respond() using .call() (eth_call)
  • Fixed all batch method error paths with typed ChainIoError variants

Layer 4: TaskSubmitter trait (crates/gateway/src/task/submitter.rs)

  • Batch methods return Result<, ChainIoError> (was Result<> via GatewayError)
  • Added simulate_batch_create_and_respond() to trait + AvsWriter impl

Layer 5: Batch submitter (crates/gateway/src/rpc/api/batch_submitter.rs)

  • max_retries default: 3 β†’ 2
  • Removed is_transient_error(), extract_failed_indices(), strip_combined_by_indices(), strip_by_indices()
  • Rewrote flush_combined_batch_with_retry() with typed ChainIoError pattern matching
  • Rewrote flush_respond_batch() similarly
  • Replaced identify_combined_poison_items() (real txs) with simulate_and_classify() (eth_call)

Layer 6: Mock impls (6 test files)

  • Updated all 9 mock TaskSubmitter impls with new return types and simulate_batch_create_and_respond()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment