Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created March 26, 2026 22:34
Show Gist options
  • Select an option

  • Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/f1fd0c07b48ad4c1908ae1e7cb0c5195 to your computer and use it in GitHub Desktop.
Plan: High-Throughput Batch Transaction Submission
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer — all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC → NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s · p95: 84s · p99: 94s
End-to-End (RPC → on-chain response)
- 82/200 completed (41%)
- p50: 138s · p95: 192s · p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain — only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/
⎿  Read crates/gateway/src/rpc/api/tx_worker.rs (597 lines)
⎿  Listed directory .claude/
⏺ This is a critical performance issue. The load test data tells a clear story: the gateway's HTTP layer is fine (sub-4s), but the single-threaded TransactionSubmissionWorker is creating a fatal bottleneck — 59% of tasks never complete
end-to-end.
Let me deeply explore the architecture before proposing solutions.
⏺ 3 Explore agents finished (ctrl+o to expand)
├─ Explore tx submission flow · 35 tool uses · 114.9k tokens
│ ⎿  Done
├─ Explore AvsWriter nonce mgmt · 22 tool uses · 118.9k tokens
│ ⎿  Done
└─ Explore gateway config and flow · 29 tool uses · 106.3k tokens
⎿  Done
❯ Load Test Report:
Configuration:
- Users: 10
- Spawn rate: 10 users/sec (all spawn immediately)
- Pacing: 15 seconds between requests per user
- Duration: 5 minutes
- Effective throughput: ~10 requests every 15 seconds (~40 req/min)
HTTP Layer — all 200 requests succeeded, median 2.8s, p95 3.5s
On-chain Submit (RPC → NewTaskCreated)
- 141/200 seen (70.5%)
- p50: 50s · p95: 84s · p99: 94s
End-to-End (RPC → on-chain response)
- 82/200 completed (41%)
- p50: 138s · p95: 192s · p99: 204s
Timeouts: 129/200 tasks (64.5%) hit the 180s response timeout
Takeaway: Gateway is healthy (sub-4s HTTP). The bottleneck is on-chain — only 41% of tasks got a full response within 3 min. Median e2e for successful tasks was ~2.3 min with p50 submit latency at ~50s, pointing to backpressure at the tx
submission layer.
there is a serious reliability issue with transaction submission worker @crates/gateway/src/rpc/api/tx_worker.rs Deep analyze the architectural structure of @create/gateway focusing on the onchain task creation and task response submission
flow and explore approaches to massively improve this reliability issue from the first principle. Follow guides and instructions from @.claude/ /clarify-plan
⏺ Skill(/clarify-plan)
⎿  Successfully loaded skill
⏺ Good — I've completed a deep exploration of the gateway's transaction submission architecture. Here's what I found before asking questions:
★ Insight ─────────────────────────────────────
The core bottleneck is architectural, not configurational. The TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask AND respondToTask) through a single sequential loop — one tokio task, one mpsc channel,
one-at-a-time processing. Each transaction blocks on get_receipt() (12-60s waiting for block confirmation). At 40 req/min generating 80 txs/min, but draining at ~2-4 txs/min, the queue is fundamentally oversubscribed by 20-40x.
─────────────────────────────────────────────────
Here are my first three questions:
1. Which chain was this load test running against, and what's the target chain for production?
Block time matters enormously here — Base mainnet (2s blocks) vs Sepolia (12s) vs Base Sepolia (2s) changes the theoretical max throughput by 6x. And the nonce pipelining strategy differs by chain's mempool behavior.
2. Is the single-signer constraint (one ECDSA key for both createNewTask and respondToTask) a hard requirement, or can the gateway operate with multiple signer keys?
The current serialization exists solely to prevent nonce conflicts on a shared signer. If we can use N signers (e.g., a pool of hot wallets), we get N× parallelism trivially. If not, we need nonce pipelining (submit tx N+1 before tx N
confirms), which is more complex but still achievable.
3. What's the target throughput and latency you need?
The load test hit 40 req/min and broke at 41% completion. Are we aiming for 40 req/min sustained with >99% completion? 100 req/min? 1000? And what's the acceptable e2e latency — is the current 2.8s HTTP response fine (since on-chain is async),
or do callers need on-chain confirmation within a specific window?
✻ Brewed for 2m 59s
❯ 1. sepolia, for prod environment with 10 gateway replicas with 10 task generator signer private keys, 1 for each 2. not a hard requirement I think, but please review the code base (including the solidity contracts code for createNewTask and
respondToTask to confirm my understanding), gateway can operate with multiple signer keys but all task generators have to be configured onchain (createNewTask, and responseToTask guarded with onlyTaskGenerator guard). If my understanding is not
correct though, then nonce pipelining would be needed 3. we are targeting at least 100 RPS (100 requests per second). Acceptable e2e latency for offchain response is around 3 seconds or lower, not longer, ideally around 2 seconds. For onchain,
we have more room, but definitely we want onchain e2e to happen by 1 min of request submission to gateway, ideally within 15-30 seconds (on Ethereum L1/sepolia)
⏺ 2 Explore agents finished (ctrl+o to expand)
├─ Explore TaskManager access control · 21 tool uses · 90.8k tokens
│ ⎿  Done
└─ Explore nonce pipelining feasibility · 28 tool uses · 109.1k tokens
⎿  Done
⏺ Excellent findings. Let me synthesize what we now know before asking the next batch of questions.
★ Insight ─────────────────────────────────────
Multiple task generators are fully supported on-chain. The OperatorRegistry uses EnumerableSet.AddressSet — both createNewTask and respondToTask use onlyTaskGenerator, and Sepolia config already has 2 generators configured. The old aggregator
role is deprecated. This means the multi-signer parallelism path is wide open — no nonce pipelining complexity needed.
The math for 100 RPS on Sepolia (12s blocks): Each task needs 2 txs (create + respond). At 100 RPS, that's 200 txs/sec. With 12s block time, each signer can pipeline ~1 tx per block optimistically, or ~5 txs/block with nonce pipelining. With N
signers doing round-robin, we need enough signers to absorb the throughput. But Sepolia's block gas limit and mempool behavior also constrain this.
─────────────────────────────────────────────────
Batch 2 — Architecture & Deployment Constraints:
4. You mentioned 10 gateway replicas with 10 task generator signer keys. Is each replica currently assigned exactly 1 signer key, or do all 10 replicas share the same key today?
This determines whether we're solving a within-replica parallelism problem (single replica needs multiple signers) or a cross-replica coordination problem (multiple replicas competing for the same nonce space). The 10-key setup suggests you've
already anticipated this — I want to confirm the current deployment topology.
5. For the 100 RPS target — is that 100 newt_createTask calls/sec, or 100 total RPC calls/sec across all methods?
At 100 task creations/sec on Sepolia (12s blocks), we'd need to land ~1,200 createNewTask + ~1,200 respondToTask = 2,400 txs per 12-second block. That's aggressive for a single chain. Are you planning to batch tasks, or is each RPC call = 1
on-chain task?
6. Is the respondToTask submission latency-critical, or only createNewTask?
Currently both flow through the same sequential queue. If respondToTask can be more relaxed (e.g., batched every few seconds), we could prioritize createNewTask in a fast lane and batch responses separately — significantly reducing queue
contention.
✻ Cooked for 2m 12s
❯ 4. each replica currently assigned exactly 1 signer key. But this can change if needed. For deploy pipeline for stagef vs prod, you can refer to @~/projects/deploy 5. 100 newt_createTask calls/sec 6. Neither has definitely latency requirement
to happen, but for sure we dont want to incur too much delay or discrepancies between offchain task/task response vs onchain. And if possible, yes, we should explore batching mechanism so that we can for example batch submit task
creation/response every interval (like that will be our "block/batch" time). 6. I initially made it same sequential queue in order to ensure that for a single task, task response NEVER gets submitted if the task has not been created
successfully. There is no difference between createNewTask vs respondToTask in terms of being how much latency critical
⎿  Listed directory ../deploy/
⏺ Great — the picture is becoming very clear. Let me ask one final batch to nail down the solution design.
★ Insight ─────────────────────────────────────
The math at 100 tasks/sec on Sepolia is extreme. 100 tasks/sec × 2 txs × 12s block time = 2,400 txs per block across the fleet. With 10 replicas at 1 signer each, that's 240 txs/block/replica. Individual transaction submission simply cannot
scale to this — batching is mandatory. A BatchTaskManager contract registered as a task generator could batch N createNewTask calls into a single tx, reducing the problem by 1-2 orders of magnitude.
─────────────────────────────────────────────────
Batch 3 — Solution Constraints:
7. Is deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable?
This is the highest-leverage change. A contract registered as a task generator that batches 50-100 tasks per transaction would cut the tx count by 50-100x. Same for respondToTask. The alternative — submitting 240 individual txs per block per
replica — is infeasible on any EVM chain regardless of nonce pipelining. Is adding a new contract to the deployment pipeline acceptable, or are we restricted to off-chain-only changes?
8. For the create→respond ordering invariant: is it acceptable if creates and responds are in different batches submitted sequentially (batch of creates first, then batch of responds a few seconds later)?
This preserves your invariant (no response without a successful create) while allowing independent batching windows. The tradeoff is slightly higher latency for the respond step (waits for the next batch interval).
9. What's the gas budget per transaction on Sepolia/mainnet for your task generator signers?
Batching 100 createNewTask calls in one tx could cost 20-50M gas. Sepolia's block gas limit is 30M. If individual createNewTask costs ~200-300K gas, we can fit ~100 per batch. But if it costs 500K+, we're limited to ~60. Do you know the
approximate gas cost of a single createNewTask and respondToTask today?
✻ Crunched for 39s
❯ 7. deploying a new batching contract (e.g., BatchTaskManager that calls taskManager.createNewTask in a loop) acceptable 8. yes acceptable 9. I do not, is there a way to estimate?
⏺ Let me estimate gas costs from the existing test suite and contract code.
⏺ Bash(cd /Users/dennis.won/projects/newton-prover-avs/contracts && forge test --match-test "test_createTask\|test_respondToTask\|test_createNewTask" --gas-report -vv…)
⎿  Compiling 150 files with Solc 0.8.27
Solc 0.8.27 finished in 22.85s
Compiler run successful with warnings:
… +13 lines (ctrl+o to expand)
⎿  (timeout 2m)
⏺ Explore(Explore createNewTask gas usage)
⎿  Done (36 tool uses · 91.3k tokens · 59s)
(ctrl+o to expand)
⏺ Good — gas estimates from code analysis:
- createNewTask: ~100-150K gas
- respondToTask: ~400-800K gas (BLS verification dominates)
- Sepolia block gas limit: 30M
★ Insight ─────────────────────────────────────
Batching math is very favorable. At 150K gas/create, we can fit ~200 creates per batch tx. At 800K gas/respond, ~37 responds per batch. Per replica at 10 tasks/sec over a 12s block: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = ~5
txs per block per signer. That's trivially achievable. Batching turns a 240-tx-per-block problem into a 5-tx-per-block problem.
─────────────────────────────────────────────────
I now have enough context to design the solution. Let me move to the Design phase.
⏺ Plan(Design batch tx submission plan)
⎿  Done (49 tool uses · 99.0k tokens · 3m 32s)
(ctrl+o to expand)
⏺ The plan from the design agent is comprehensive. Let me verify one critical detail before writing the final plan.
⎿  Loaded ../.claude/rules/solidity.md
Searched for 1 pattern, read 1 file (ctrl+o to expand)
⏺ Now I have the full picture. Let me read the rest of respondToTask to confirm the batch contract's parameter requirements, then write the final plan.
Read 1 file (ctrl+o to expand)
⏺ Good — I have the full picture. Now let me write the final plan.
⏺ Updated plan
⎿  /plan to preview
⏺ The plan is written. Here's the summary:
★ Insight ─────────────────────────────────────
The key insight is that batching transforms the problem entirely. Instead of 240 individual transactions competing for block space per replica per block (impossible), we get 5 batch transactions per block (trivial). The BatchTaskManager
contract is the highest-leverage change — it reduces the on-chain tx count by ~50x. The Rust-side batch accumulator then collects tasks into time-windowed batches (2s intervals), and nonce pipelining ensures those 5 txs don't block
sequentially on receipt confirmation. The entire solution is feature-gated behind batch_submitter.enabled, so the old sequential worker stays as fallback.
The ordering invariant (no respondToTask before createNewTask confirms) is preserved via a DashMap<TaskId, ()> shared between the create and respond accumulators — responds are held until their corresponding create is confirmed on-chain.
─────────────────────────────────────────────────
4 layers, 6 implementation steps:
1. BatchTaskManager.sol — batch N calls into 1 tx (Solidity, no Rust dependency)
2. AvsWriter batch methods — Rust bindings to call the batch contract
3. BatchTransactionSubmitter — replaces tx_worker.rs with time-windowed batch accumulation
4. Nonce pipelining — submit 5 batch txs concurrently, await receipts in parallel
5. Integration testing + load test at 100 tasks/sec
6. Multi-signer pool (only if needed — math says it won't be)
⏺ User approved Claude's plan
⎿  Plan saved to: ~/.claude/plans/valiant-giggling-moler.md · /plan to edit
Plan: High-Throughput Batch Transaction Submission
Context
Load testing at just 40 req/min revealed a critical bottleneck: TransactionSubmissionWorker processes ALL on-chain transactions (both createNewTask and respondToTask) through a single sequential loop, blocking on get_receipt() (12-60s per
tx on Sepolia). Only 41% of tasks completed end-to-end; 64.5% hit the 180s timeout.
Target: 100 tasks/sec sustained across 10 gateway replicas (10/sec/replica), with <3s offchain response and <30s on-chain e2e on Sepolia (12s blocks).
Root cause: The worker can process ~2-4 txs/min but needs 200 txs/sec. Throughput gap: ~3000x.
Solution: Four coordinated layers — a Solidity batch contract, a Rust batch accumulator, nonce pipelining, and an optional multi-signer pool.
---
Layer 1: BatchTaskManager Solidity Contract
What
A new standalone contract that batches N createNewTask and N respondToTask calls into single transactions. Registered as a task generator on OperatorRegistry.
Why
At 150K gas/create, one batch tx fits ~200 creates (30M block gas limit). At 800K gas/respond, ~37 per batch. Per replica per 12s block at 10 tasks/sec: 120 creates (1 batch tx) + 120 responds (3-4 batch txs) = 5 txs per block instead of
240. This is the highest-leverage change.
Design
Error semantics: try/catch per item (not revert-all). A single TaskAlreadyExists or TaskAlreadyResponded in a batch of 120 must not revert the other 119.
File: contracts/src/BatchTaskManager.sol
contract BatchTaskManager {
INewtonProverTaskManager public immutable taskManager;
// Per-item result events
event BatchCreateResult(bytes32 indexed taskId, bool success, bytes reason);
event BatchRespondResult(bytes32 indexed taskId, bool success, bytes reason);
constructor(address _taskManager) {
taskManager = INewtonProverTaskManager(_taskManager);
}
function batchCreateTasks(
INewtonProverTaskManager.Task[] calldata tasks
) external returns (uint256 successCount) {
for (uint256 i; i < tasks.length;) {
try taskManager.createNewTask(tasks[i]) {
emit BatchCreateResult(tasks[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchCreateResult(tasks[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
function batchRespondToTasks(
INewtonProverTaskManager.Task[] calldata tasks,
INewtonProverTaskManager.TaskResponse[] calldata responses,
bytes[] calldata signatureDataArray
) external returns (uint256 successCount) {
require(tasks.length == responses.length && responses.length == signatureDataArray.length);
for (uint256 i; i < tasks.length;) {
try taskManager.respondToTask(tasks[i], responses[i], signatureDataArray[i]) {
emit BatchRespondResult(responses[i].taskId, true, "");
unchecked { ++successCount; }
} catch (bytes memory reason) {
emit BatchRespondResult(responses[i].taskId, false, reason);
}
unchecked { ++i; }
}
}
}
Access control: The BatchTaskManager address is registered as a task generator via OperatorRegistry.addTaskGenerator(). The contract itself needs no auth modifier — only registered task generators can call taskManager.createNewTask (the
contract is the msg.sender to TaskManager). Optionally add an onlyAuthorized mapping to prevent gas griefing from arbitrary callers.
Files to create/modify
┌─────────────────────────────────────────────────┬────────────────────────────────────────────────────────────┐
│ File │ Action │
├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│ contracts/src/BatchTaskManager.sol │ Create — batch contract │
├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│ contracts/src/interfaces/IBatchTaskManager.sol │ Create — interface for Rust bindings │
├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│ contracts/test/BatchTaskManager.t.sol │ Create — unit tests │
├─────────────────────────────────────────────────┼────────────────────────────────────────────────────────────┤
│ contracts/script/BatchTaskManagerDeployer.s.sol │ Create — deployment script following DeploymentLib pattern │
└─────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
Deployment steps
1. Deploy BatchTaskManager with existing TaskManager address
2. Call OperatorRegistry.addTaskGenerator(batchTaskManagerAddress)
3. Write deployment to script/deployments/batch-task-manager/{chainId}-{env}.json
4. Run make generate-bindings to auto-generate Rust types in crates/core/src/generated/
---
Layer 2: Batch Accumulator (Rust — replaces tx_worker.rs)
What
Replace the sequential TransactionSubmissionWorker with BatchTransactionSubmitter that accumulates tasks into time-windowed batches and submits via the BatchTaskManager contract.
Architecture
sync.rs ─────► CreateBatch Accumulator ──flush every 2s──► batchCreateTasks() on-chain
(tracks confirmations via DashMap)
sync.rs ─────► RespondBatch Accumulator ─flush after creates confirm──► batchRespondToTasks()
Two independent accumulator loops as background tokio tasks, connected by a shared DashMap<TaskId, ()> that tracks which creates have been confirmed on-chain.
Batch flush triggers
1. Timer: batch_interval_ms (default 2000ms — one batch per ~1 block on Sepolia)
2. Size: batch reaches max_batch_size
3. Both accumulator loops run independently
Ordering invariant
The respond accumulator holds items in a pending queue. On each flush tick, it moves items whose task_id appears in the confirmed_creates: DashMap<TaskId, ()> into the current respond batch. Items waiting longer than respond_wait_timeout_ms
(default 60s) are dropped with an error log.
Public interface (backward compatible)
BatchTransactionSubmitter exposes the same methods as TransactionSubmissionWorker:
- submit_create_task(task_id, params) -> oneshot::Receiver<Result<...>>
- submit_aggregated_response(task_id, task_request, service_response) -> Result<()>
- queue_capacity() -> usize / queue_depth() -> usize
This means sync.rs and handler/mod.rs only swap the type name.
Config
pub struct BatchSubmitterConfig {
pub enabled: bool, // default: false (feature-gated rollout)
pub max_create_batch_size: usize, // default: 120
pub max_respond_batch_size: usize, // default: 30
pub batch_interval_ms: u64, // default: 2000
pub respond_wait_timeout_ms: u64, // default: 60000
pub batch_contract_address: Address, // from deployment JSON
}
Files to create/modify
┌───────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────┐
│ File │ Action │
├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Create — new batch accumulator │
├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/rpc/api/mod.rs │ Modify — add mod batch_submitter export │
├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/config.rs │ Modify — add BatchSubmitterConfig to GatewayConfig │
├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/handler/mod.rs │ Modify — instantiate BatchTransactionSubmitter when enabled (lines ~549, ~856) │
├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ Modify — add batch_create_tasks() and batch_respond_to_tasks() methods │
├───────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/task/submitter.rs │ Modify — add batch methods to TaskSubmitter trait (or create BatchTaskSubmitter trait) │
└───────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
Migration path
1. Phase A: Ship behind batch_submitter.enabled = false. Old TransactionSubmissionWorker remains default.
2. Phase B: Enable on stagef, load test. Flip default to true.
3. Phase C: Remove TransactionSubmissionWorker after production validation.
---
Layer 3: Nonce Pipelining
What
After batching reduces txs from 240/block to ~5/block, we still need those 5 txs submitted without blocking on each receipt sequentially. Split send() from get_receipt().
Design
In the batch flush path:
// Fire phase: submit all batch txs with incrementing nonces
let pending_txs: Vec<PendingTransaction> = vec![];
for batch in batches {
let pending = batch_contract.batchCreateTasks(batch).send().await?;
pending_txs.push(pending);
}
// Await phase: concurrently wait for all receipts
let receipts = futures::future::join_all(
pending_txs.into_iter().map(|tx| tx.get_receipt())
).await;
Alloy's nonce manager auto-increments on each send() call, so nonces [N, N+1, N+2, ...] are assigned correctly without manual tracking.
Recovery
If tx at nonce N fails after N+1 is in mempool: N+1 is stuck. The recovery path:
1. Detect the failure from the receipt
2. Re-submit N with bumped gas (replacement tx)
3. Or cancel N with a zero-value self-transfer and re-submit the batch
Files to create/modify
┌───────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────┐
│ File │ Action │
├───────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ Modify — add send_without_receipt() variant for pipelined submission │
├───────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Modify — use pipelined send in flush logic │
└───────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────┘
---
Layer 4: Multi-Signer Pool (Optional — only if needed)
What
Round-robin across N signer keys per replica. Each signer has independent nonce space.
When to implement
Only if load testing after Layers 1-3 shows a single signer can't keep up. The math says 5 txs/block is well within single-signer capacity, so this is likely unnecessary.
Design
pub struct SignerPool {
signers: Vec<Arc<AvsWriter>>,
next: AtomicUsize,
}
All signer addresses call the same BatchTaskManager contract (which is the registered task generator). No on-chain changes needed beyond Layer 1.
Files
┌───────────────────────────────────┬────────────────────────────────────────────────────────┐
│ File │ Action │
├───────────────────────────────────┼────────────────────────────────────────────────────────┤
│ crates/gateway/src/signer_pool.rs │ Create (only if needed) │
├───────────────────────────────────┼────────────────────────────────────────────────────────┤
│ crates/gateway/src/config.rs │ Modify — add additional_signers to TaskGeneratorConfig │
└───────────────────────────────────┴────────────────────────────────────────────────────────┘
---
Expected Performance
┌────────────────────────┬──────────────────┬────────────────────────────────────────┐
│ Metric │ Current │ After Fix │
├────────────────────────┼──────────────────┼────────────────────────────────────────┤
│ Offchain response │ 2.8s (OK) │ 2.8s (unchanged) │
├────────────────────────┼──────────────────┼────────────────────────────────────────┤
│ On-chain task creation │ p50=50s │ ~14s (2s accumulate + 12s block) │
├────────────────────────┼──────────────────┼────────────────────────────────────────┤
│ On-chain e2e │ p50=138s │ ~28s (create confirm + 2s + 12s block) │
├────────────────────────┼──────────────────┼────────────────────────────────────────┤
│ Completion rate │ 41% │ >99% │
├────────────────────────┼──────────────────┼────────────────────────────────────────┤
│ Max throughput/replica │ ~2-4 txs/min │ 10 tasks/sec (120/block) │
├────────────────────────┼──────────────────┼────────────────────────────────────────┤
│ Txs per block/replica │ 240 (impossible) │ 5 (trivial) │
└────────────────────────┴──────────────────┴────────────────────────────────────────┘
---
Implementation Sequence
Step 1: BatchTaskManager contract (no Rust dependency)
1. Create contracts/src/BatchTaskManager.sol and interface
2. Create contracts/test/BatchTaskManager.t.sol with tests:
- All succeed, partial failure, auth checks, gas measurement
3. Create deployment script following DeploymentLib pattern
4. forge test && forge fmt
5. Run make generate-bindings
Step 2: AvsWriter batch methods (depends on Step 1 bindings)
1. Add batch_create_tasks() and batch_respond_to_tasks() to AvsWriter
2. Unit tests for the new methods
Step 3: BatchTransactionSubmitter (depends on Step 2)
1. Implement BatchTransactionSubmitter in batch_submitter.rs
2. Add BatchSubmitterConfig to GatewayConfig
3. Wire into handler/mod.rs behind enabled flag
4. Unit tests for accumulator logic, ordering invariant, flush triggers
Step 4: Nonce pipelining (depends on Step 3)
1. Add pipelined send to the flush path
2. Add nonce gap recovery logic
Step 5: Integration testing
1. Update E2E tests to exercise batch path
2. Load test: 100 tasks/sec sustained for 5 minutes
3. Verify ordering invariant
4. Verify partial batch failure handling
Step 6: Multi-signer pool (only if Step 5 load test shows need)
---
Verification
1. forge test — all BatchTaskManager contract tests pass
2. make clippy — no new warnings
3. make tests — existing unit tests unaffected (feature-gated)
4. make gateway_e2e — existing E2E still works with old worker (default)
5. make gateway_e2e with BATCH_SUBMITTER_ENABLED=true — E2E works with new batch path
6. Load test at 100 tasks/sec for 5 min: >99% completion, <30s on-chain e2e
---
Risks and Mitigations
┌──────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────┐
│ Risk │ Mitigation │
├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Batch contract deployment fails │ Old sequential worker remains as fallback (feature flag) │
├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Partial batch failure loses tasks │ try/catch per item; failures logged with task ID for retry │
├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Nonce gap from failed pipeline tx │ Re-query nonce, re-submit remaining; fallback to sequential │
├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Respond submitted before create confirms │ DashMap gating enforces ordering; items time out rather than submit early │
├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Gas estimation wrong for large batches │ Conservative defaults (120 creates, 30 responds); configurable per env │
├──────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Offchain response latency regresses │ Batch accumulator is completely decoupled from RPC response path — no impact │
└──────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┘
@denniswon
Copy link
Copy Markdown
Author


Implementation Plan

  1. Document the config architecture pattern

Add to CLAUDE.md the three-tier config pattern:

┌───────────────────────────────────────┬───────────────────────────────┬──────────────────────────────────────────────────┐
│ Source │ What goes here │ Examples │
├───────────────────────────────────────┼───────────────────────────────┼──────────────────────────────────────────────────┤
│ Deployment JSONs │ Addresses of contracts we │ newtonProverTaskManager, batchTaskManager, │
│ (script/deployments/) │ deploy and manage │ operatorRegistry │
├───────────────────────────────────────┼───────────────────────────────┼──────────────────────────────────────────────────┤
│ Config JSONs │ External/configurable values │ aggregator_addr (EOA), taskGenerator (EOAs), │
│ (newton_prover_config.{env}.json) │ we don't manage │ task_response_window_block, feature flags │
├───────────────────────────────────────┼───────────────────────────────┼──────────────────────────────────────────────────┤
│ Environment variables │ Secrets and runtime-only │ PRIVATE_KEY, DATABASE_URL, REDIS_URL │
│ │ config │ │
└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────────────────┘

  1. Add batchTaskManager to DeploymentLib.DeploymentData
  • Add field to DeploymentData struct in DeploymentLib.sol
  • Use readAddressOr pattern so existing JSONs don't break
  • BatchTaskManagerDeployer writes to newton-prover/{chainId}-{env}.json (not a separate directory)
  • Update existing deployment JSONs to include the field
  1. Integrate into deploy-avs.sh
  • Post-step after core AVS contracts: deploy BatchTaskManager if not already deployed
  • On upgrade=true: skip BatchTaskManager if address already exists in deployment JSON
  • Register as task generator on OperatorRegistry, authorize configured taskGenerator addresses
  • Support batch=false to skip (for debugging)
  1. Gateway config refactor
  • Remove batch_contract_address from BatchSubmitterConfig
  • Gateway reads batchTaskManager from deployment JSON / NewtonAvsConfig at startup
  • batch_submitter.enabled defaults to true
  • Keep batch=false / GATEWAY__BATCH_SUBMITTER__ENABLED=false as escape hatch for legacy path
  1. Integration tests default to batch mode
  • All E2E test infra deploys BatchTaskManager as part of setup
  • Gateway starts with batch submitter enabled by default
  • Add batch=false flag to test targets for legacy path testing
  • Update make gateway (local dev) to include batch contract in deploy-all
  1. CI cache version bump
  • Bump anvil state cache version so CI picks up fresh state with BatchTaskManager
  1. Makefile target
  • Add deploy-batch-task-manager standalone target (for manual one-off deploys on stagef/prod)
  • deploy-all calls it automatically after deploy-avs

Want me to proceed with implementation? This is a substantial set of changes across Solidity deploy scripts, Rust gateway
config, Makefile, E2E test infra, and CLAUDE.md. I'd suggest tackling them in the order listed above.

@denniswon
Copy link
Copy Markdown
Author

The batchTaskManager address (0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900) IS present in newton-cross-chain/84532-stagef.json, so the
batch submitter should be initialized. The issue is likely that the on-chain submission is silently failing.

★ Insight ─────────────────────────────────────
Your Datadog filter is searching by task ID (l393a98f3a585a7cb754dcfa6b). The batch submitter's flush logs ("flushing create batch",
"flushing respond batch") don't include individual task IDs — they batch multiple tasks together. So these logs won't appear in a
task-ID-filtered search. Try removing the task ID filter and searching for "flushing" or "batch" to see if the on-chain submissions
are happening and/or failing.
─────────────────────────────────────────────────

That said, the most likely reason all submissions are failing is that the BatchTaskManager contract isn't registered as a
taskGenerator on the NewtonProverTaskManager. Let me verify.

Searched for 2 patterns (ctrl+o to expand)

⏺ Now let me check how the BatchTaskManager contract calls the TaskManager — specifically whether msg.sender to createNewTask would be
the BatchTaskManager contract address.

Read 1 file (ctrl+o to expand)

⏺ There it is. The BatchTaskManager contract calls taskManager.createNewTask(tasks[i]) at line 77 and taskManager.respondToTask(...) at
line 118. When it does, msg.sender to the TaskManager is the BatchTaskManager contract address, not the gateway EOA.

The TaskManager's onlyTaskGenerator modifier at NewtonProverTaskManagerShared.sol:30-36 checks:
IOperatorRegistry(operatorRegistry).isTaskGenerator(msg.sender)

If the BatchTaskManager contract (0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900) was not registered as a taskGenerator in the
OperatorRegistry on Base Sepolia, every batch call reverts with OnlyTaskGenerator. The gateway's EOA was already registered, but the
new BatchTaskManager contract is a separate address.

★ Insight ─────────────────────────────────────
This is a classic delegation pattern issue. Before PR #476, the gateway EOA called createNewTask/respondToTask directly on the
TaskManager (EOA → TaskManager). Now with batching, the call chain is EOA → BatchTaskManager → TaskManager. The TaskManager sees
msg.sender = BatchTaskManager contract, not the gateway EOA. The onlyTaskGenerator check fails because the BatchTaskManager address
was never added to the OperatorRegistry's task generator allowlist.

The batch contract's NatSpec even says: "This contract is registered as a task generator on OperatorRegistry" (line 9) — but this
registration is a deployment step that may have been missed.
─────────────────────────────────────────────────

Two issues to fix

  1. Register BatchTaskManager as a taskGenerator (critical — blocking all submissions)

On Base Sepolia (84532), call OperatorRegistry.addTaskGenerator()

cast send <OPERATOR_REGISTRY_ADDR>
"addTaskGenerator(address)"
0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900
--rpc-url <BASE_SEPOLIA_RPC>
--private-key <ADMIN_KEY>

The OperatorRegistry address for stagef on Base Sepolia is 0xfd779c14ca9e4e2275fe11ff7b80742bf3e0da4f (from the deployment JSON).

Also register the gateway EOA as authorized on the BatchTaskManager itself (line 31 onlyAuthorized modifier):

cast send 0x4a7a311daf5ae9dbb53cc1106f81de9d7b558900
"addAuthorized(address)"
<GATEWAY_TASK_GENERATOR_EOA>
--rpc-url <BASE_SEPOLIA_RPC>
--private-key <BATCH_TASK_MANAGER_OWNER_KEY>

  1. Policy schema mismatch (the original error from dd.csv)

The max_daily_spend schema validation failure is a separate, pre-existing issue — the IPFS schema expects max_daily_spend at root but
the on-chain params nest it under agents.

. Fix by updating either the schema or the params.

Why you see no error logs in Datadog

The batch submitter's flush errors would appear at error! level but potentially under a different service tag or without the task ID
you're filtering on. Try searching Datadog for:

  • "flushing create batch" or "batch create" without a task ID filter
  • "OnlyTaskGenerator" in error messages
  • Error/Warn levels in the newton-prover-gateway service around the same timeframe

@denniswon
Copy link
Copy Markdown
Author

Can you check Datadog for these (no task ID filter, full time range):

  1. "batch create accumulator cancelled" or "batch respond accumulator cancelled"
  2. "batch transaction submitter started" — when was the last occurrence?
  3. Any Error level logs between 12:39 and 15:43 from newton-prover-gateway

Immediate fix: restart the gateway service. This will restart the batch accumulator loops.

Root cause to investigate: what killed the accumulator. The flush_create_batch_with_retry has catch_unwind (line 390-392) to protect
against panics during flush, but a panic elsewhere in the loop (e.g., during channel receive or timeout handling) would silently kill
the spawned task.

✻ Sautéed for 1m 10s

❯ [Image #7] I dont see "batch create accumulator cancelled" or "batch respond accumulator cancelled". no error logs from Error level
logs between 12:39 and 12:43 from newton-prover-gateway
⎿  [Image #7]

⏺ This confirms it. The batch submitter started at 11:15 and was working at 12:38-12:39. Then it silently died — no cancellation, no
error, no crash log.

The accumulator loops run inside tokio::spawn. When a spawned task panics, the panic goes to stderr — not to structured
tracing/Datadog unless a custom panic hook is configured. This is why there are zero Error logs.

Immediate fix

Restart the gateway. This will restart the batch accumulator loops and the pending creates will start flowing again.

Code fix needed

The accumulator loops need panic protection around the entire loop body, not just during flush. Currently catch_unwind only wraps the
flush_create_batch_with_retry call (line 390), but a panic anywhere else in the loop (channel receive, item processing, timer tick
handling) silently kills the spawned task.

Read 1 file (ctrl+o to expand)

⏺ Looking at this code, I now think the issue isn't a panic — it's a hung RPC call. The accumulator loop is likely stuck inside
flush_create_batch_with_retry at line 344, which calls:

// writer.rs:503-507
let send_result = contract.batchCreateTasks(batch_tasks).send().await;
// ...
let receipt = pending.get_receipt().await // ← can hang forever

If a batch tx was sent but the receipt never arrives (tx dropped from mempool, gas too low, RPC node stall), get_receipt().await
blocks indefinitely. The accumulator loop is stuck waiting, all subsequent items queue up in the channel but are never processed.

This explains everything:

  • No error/panic logs (nothing failed — it's just waiting)
  • No cancellation logs (loop didn't exit)
  • Creates queued successfully (channel has capacity)
  • But no flush ever fires (loop is blocked on the stuck receipt)

Immediate fix

Restart the gateway.

Code fix

The batch_create_tasks call needs a timeout. Want me to add one? It would wrap the send+receipt in tokio::time::timeout:

// In AvsWriter::batch_create_tasks, wrap the receipt wait
let receipt = tokio::time::timeout(
Duration::from_secs(60),
pending.get_receipt()
).await
.map_err(|_| ChainIoError::CreateNewTaskCallFail {
reason: "batch create receipt timed out after 60s".into()
})??;

Also need to check receipt.status() — per lessons.md, get_receipt() doesn't propagate on-chain reverts.

@denniswon
Copy link
Copy Markdown
Author

★ Insight ─────────────────────────────────────
What these changes fix:

  1. Receipt timeout (60s) — get_receipt().await can hang indefinitely if a tx is dropped from the mempool. The timeout ensures the
    accumulator loop recovers and continues processing subsequent batches instead of silently dying.
  2. Receipt status check — Per lessons.md, .get_receipt() returns Ok even for on-chain reverts. Without checking receipt.status(), the
    batch submitter would mark failed creates as "confirmed" in the DashMap, then the respond accumulator would try to submit responses
    for tasks that don't actually exist on-chain.
  3. Improved logging — Respond submissions now log at INFO level (was DEBUG), making successful on-chain submissions visible in
    Datadog. The batch summary log includes success/fail counts for quick health monitoring.
    ─────────────────────────────────────────────────

Summary of changes

┌───────────────────────────────────────────────┬───────────────────────────────────────────────┬─────────────────────────────────┐
│ File │ Change │ Why │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ 60s timeout on get_receipt() in both │ Prevents accumulator loop from │
│ │ batch_create_tasks and batch_respond_to_tasks │ hanging forever on dropped txs │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ receipt.status() check after receipt │ Detects on-chain reverts that │
│ │ │ get_receipt() silently swallows │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/chainio/src/avs/writer.rs │ Log tx_hash on success │ Makes on-chain txs traceable in │
│ │ │ Datadog │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Panic payload in error log (?panic_info) │ Makes panic root cause visible │
│ │ │ in Datadog │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Respond success log promoted to INFO │ Makes successful responds │
│ │ │ visible in Datadog │
├───────────────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────┤
│ crates/gateway/src/rpc/api/batch_submitter.rs │ Batch flush summary with success/fail counts │ Quick health monitoring per │
│ │ │ flush cycle │
└───────────────────────────────────────────────┴───────────────────────────────────────────────┴─────────────────────────

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment