-
-
Save denniswon/d466df9c38324a9d4b3362bdd74604f7 to your computer and use it in GitHub Desktop.
| ★ Insight ───────────────────────────────────── | |
| 0xc73a136a = InvalidGlobalTableRoot() — this is from the OperatorTableUpdater contract (EigenLayer's | |
| IOperatorTableUpdater interface). | |
| ───────────────────────────────────────────────── | |
| The transporter is calling confirmGlobalTableRoot on the destination chain's ECDSAOperatorTableUpdater (Base Sepolia | |
| 84532), and the contract is rejecting the Merkle root being submitted. | |
| This typically means one of: | |
| 1. Stale or mismatched operator table state — the source chain's BN254TableCalculator has operator weights/roots that | |
| don't match what the transporter is computing. Per the lessons in lessons.md (NEWT-845), if weights changed on the | |
| source chain without a successful sync, the roots diverge. | |
| 2. The root being submitted is zero or invalid — e.g., no operators are configured in the table calculator on the source | |
| chain. | |
| 3. Reference block issue — the transporter uses latest_block - 1 for the reference timestamp. If the source chain state | |
| at that block doesn't have a valid operator table, the computed root is invalid. | |
| Let me check how the contract validates this. | |
| Searched for 1 pattern (ctrl+o to expand) | |
| ⏺ The key line is ECDSAOperatorTableUpdater.sol:156: | |
| require(_globalTableRoots[referenceTimestamp] == globalTableRoot, InvalidGlobalTableRoot()); | |
| This reverts during updateOperatorTable — the contract checks that the globalTableRoot passed by the transporter matches | |
| the root previously stored via confirmGlobalTableRoot for that referenceTimestamp. | |
| ★ Insight ───────────────────────────────────── | |
| The flow is two-step: (1) confirmGlobalTableRoot(referenceTimestamp, globalTableRoot) stores the root (owner-only), then | |
| (2) updateOperatorTable(...) verifies per-operator Merkle proofs against that stored root. The error means either the | |
| root was never stored for this timestamp, or the transporter is computing a different root than what's on-chain. | |
| ───────────────────────────────────────────────── | |
| Most likely causes on stagef: | |
| 1. Operator table changed on source chain without re-syncing the root — if someone ran SeedTableCalculator or | |
| registered/deregistered operators on Sepolia, the source chain's table root changed. The transporter computes the new | |
| root locally but the destination chain still has the old root stored. The confirmGlobalTableRoot call succeeds (stores | |
| new root), but if that step failed or was skipped, updateOperatorTable sees a mismatch. | |
| 2. confirmGlobalTableRoot itself is reverting — check if the transporter signer is still the owner of | |
| ECDSAOperatorTableUpdater on Base Sepolia. Per lessons.md, if the owner doesn't match the transporter signer, root | |
| commits fail silently upstream, leaving _globalTableRoots[referenceTimestamp] as zero. | |
| Quick diagnostics to run: | |
| # Check who owns the ECDSAOperatorTableUpdater on Base Sepolia | |
| cast call <ECDSAOperatorTableUpdater_address> "owner()" --rpc-url <BASE_SEPOLIA_RPC> | |
| # Compare against the transporter's signer address | |
| # (from TRANSPORTER__SIGNER__PRIVATE_KEY in stagef config) | |
| If the owner doesn't match the transporter signer, you'll need to call transferOwnership() from the current owner. If | |
| they do match, the source chain table calculator state likely diverged — run make transport-sync network=testnet | |
| private_key=0x... to force a fresh full sync. |
Interesting — the operator set is already configured and
both operators are already registered. The script failed
on the setOperatorInfoTreeRoot call because the earlier
reverts (caught by try/catch) still caused the overall
simulation to fail in Forge's strict mode.
But the key finding is: the source chain table calculator
state looks correct — operators are configured and
registered. The issue is likely that the
operatorInfoTreeRoot is stale or was never set after the
last operator change.
Let me check the current root on-chain vs what the script
computed.
⏺ Bash(# Check current operatorInfoTreeRoot on source chain
calculator
cast call 0xFC5e4999FF977EC4AE854dE714a365a332309cEd
"getOperatorSetInfo((address,uint32))" "(0x…)
⎿ 0x000000000000000000000000000000000000000000000000
0000000000000020daeeb249d70d26f62be31764833eaf4165
d35d666c474c64ceb640b9642cdb5e00000000000000000000
… +8 lines (ctrl+o to expand)
⏺ The source chain is fine — getOperatorSetInfo returns
successfully with the correct root 0xdaeeb249... when
using the correct operator set (serviceManager, 0). My
earlier query used the wrong AVS address (transporter
signer instead of service manager).
The script failed because Forge's --broadcast mode treats
caught reverts as simulation failures. But the on-chain
state is already correct — operators are registered and
the root matches.
So the problem is specifically in the transporter's sync
to Base Sepolia. Let me check the transporter flow more
carefully.
Searched for 2 patterns, read 1 file (ctrl+o to expand)
⏺ I see the flow now. Let me check the updater to understand
what's happening at the contract call level.
Read 1 file (ctrl+o to expand)
⏺ Found the bug. Look at confirm_global_root in
updater.rs:43-51:
if !receipt.status() {
warn!(
"confirmGlobalTableRoot reverted for root={:?},
tx_hash={}",
root, receipt.transaction_hash
);
} else {
info!("confirmed global table root: {:?}", root);
}
Ok(receipt) // <-- Always returns Ok, even on revert!
★ Insight ─────────────────────────────────────
This is a silent-revert bug. When confirmGlobalTableRoot
reverts on-chain (e.g., GlobalTableRootStale because the
referenceTimestamp is not strictly greater than the last
confirmed timestamp), the transporter logs a warning but
returns Ok(receipt) instead of an error. It then proceeds
to call updateOperatorTable, which checks
_globalTableRoots[referenceTimestamp] == globalTableRoot —
but no root was stored for that timestamp because the
first call reverted. Result: InvalidGlobalTableRoot.
The same pattern exists in update_operator_table (line
72-79) — also swallows reverts.
─────────────────────────────────────────────────
The chain of events on stagef:
- Transporter picks latest_block - 1 on Sepolia → gets
referenceTimestamp - Calls confirmGlobalTableRoot(root, timestamp,
blockNumber) → reverts (likely GlobalTableRootStale
because a previous sync already confirmed a root with a
timestamp >= this one) - Transporter logs a warn! but continues because it
returns Ok(receipt) - Calls updateOperatorTable(timestamp, root, ...) →
reverts with InvalidGlobalTableRoot because no root exists
at that timestamp - This error propagates (the .send().await? in
update_operator_table throws), and that's what you see in
the logs
There are two things to fix:
- Immediate operational fix: Check the transporter logs
for the warn! from confirmGlobalTableRoot reverted. The
underlying revert is likely GlobalTableRootStale — meaning
the _latestReferenceTimestamp on Base Sepolia is already
= what the transporter is trying to confirm. This could
happen if another transporter instance already synced, or
if the transporter daemon is running with a too-short
interval relative to Sepolia's block time.
- Code fix: Both confirm_global_root and
update_operator_table need to return an error when
!receipt.status(), not just log a warning. Want me to fix
this?
Compiles clean. Now let me also add this to lessons
learned since it's a recurring pattern (silent reverts).
★ Insight ─────────────────────────────────────
- Silent revert pattern: When using alloy's
.send().await?.get_receipt().await?, the ? only propagates
RPC/network errors — it does NOT check whether the
transaction itself reverted on-chain. You must explicitly
check receipt.status() and return an error. This is
different from .call().await? which DOES propagate
contract reverts. - Cascading failure: The confirmGlobalTableRoot revert was
silent, so the transporter proceeded to
updateOperatorTable which failed with a misleading
InvalidGlobalTableRoot — the real error was upstream
(likely GlobalTableRootStale).
─────────────────────────────────────────────────
Done. Here's a summary of what was done and what remains:
Code fix (crates/transporter/src/updater.rs):
- confirm_global_root — now returns Err when
!receipt.status() instead of logging a warning and
continuing - update_operator_table — same fix
- Removed unused warn import
Lesson added (.claude/rules/lessons.md):
- Documented the alloy .send().get_receipt() silent-revert
pattern under Code Patterns
Operational status on stagef:
- Source chain (Sepolia) table calculator is correctly
configured — operators registered, operatorInfoTreeRoot
set - The transporter is failing because
confirmGlobalTableRoot reverts (likely
GlobalTableRootStale — the referenceTimestamp is not
strictly greater than the last confirmed one) and the code
was silently continuing - After deploying this fix, the transporter will properly
fail-fast on confirmGlobalTableRoot revert and surface the
real error in logs - You may still need to investigate why
confirmGlobalTableRoot is reverting — likely the
_latestReferenceTimestamp on Base Sepolia is already ahead
of what the transporter is computing, possibly from a
concurrent or recent sync
Progress. confirmGlobalTableRoot succeeded — the root was stored on Base Sepolia. The failure is now nonce too low: next
nonce 15, tx nonce 14 on the updateOperatorTable call.
★ Insight ─────────────────────────────────────
This is the nonce race documented in lessons.md: the confirmGlobalTableRoot tx used nonce 14 and was mined, but the
local nonce cache still thinks the next nonce is 14 (not 15). The CachedNonceManager queried
eth_getTransactionCount("pending") at the start and cached it — but by the time updateOperatorTable tries to send, the
first tx has already been mined and incremented the on-chain nonce.
─────────────────────────────────────────────────
Let me just retry — the nonce cache will refresh on the second attempt, and confirmGlobalTableRoot will fail with
GlobalTableRootStale (same root already stored), which our code fix now properly surfaces. But actually, the timestamp
will be newer on retry so it should work.
⏺ Bash(sleep 15 && RPC_HTTP_URL_11155111="https://dimensional-few-slug.ethereum-sepolia.quiknode.pro/850b4f6d0352ef457cda2
7de9f97354e01514afa"
RPC_HTTP_URL_84532="ht…)
⎿ Finished dev profile [unoptimized + debuginfo] target(s) in 0.67s
Running target/debug/newton-prover-transporter sync --network testnet
2026-03-25T17:34:18.880759Z INFO newton_prover_transporter::commands: starting one-shot transport sync to 1 des
… +29 lines (ctrl+o to expand)
⎿ (timeout 5m)
⏺ Transport sync succeeded:
confirmed global table root: 0xb815dcc2...
updated operator table at index 0
transport sync complete: success=1 failed=0 total=1
Summary of what happened and what was fixed:
- Root cause of the original InvalidGlobalTableRoot error: Most likely the stagef transporter ECS service was hitting
RPC rate limits (429s) from default/shared RPC endpoints. When RPC calls fail mid-computation, the transporter produces
an invalid root. Combined with the silent-revert bug, the real error was masked. - Code fix (crates/transporter/src/updater.rs): confirm_global_root and update_operator_table now return Err on
reverted receipts instead of swallowing them. This prevents the cascading InvalidGlobalTableRoot error when
confirmGlobalTableRoot actually reverts for a different reason (e.g., GlobalTableRootStale). - Operational fix: The transporter needed proper RPC_HTTP_URL_11155111 and RPC_HTTP_URL_84532 env vars pointing to
endpoints with adequate rate limits. The stagef ECS task definition should have these configured. - State is now healthy: The operator table root is confirmed on Base Sepolia and the operator table is updated at index
- The transporter should continue working on subsequent sync intervals.
★ Insight ─────────────────────────────────────
The "Quorum NOT reached" on destination chains is a consistency gap, not a config problem. Three independent data sources must agree
but are read at different times:
- Gateway reads CURRENT operator weights from source chain's BN254TableCalculator.getOperatorSetWeights() to build merkle proofs
- Destination chain's BN254CertificateVerifier validates against its STORED merkle root (from the last transporter sync)
- Transporter syncs on a 24-hour interval (update_frequency = 86400) or on OperatorAdded/OperatorRemoved/OperatorSlashed events
Any change between syncs — stake weight updates, delegations, admin operations — causes the gateway's freshly-computed merkle tree to
differ from the verifier's stored root. The merkle proof fails silently (empty 0x revert), which surfaces as "Quorum NOT reached."
─────────────────────────────────────────────────
Why It Keeps Recurring
The transporter's event listener (crates/transporter/src/listener.rs) only watches for three AllocationManager events:
- OperatorAddedToOperatorSet
- OperatorRemovedFromOperatorSet
- OperatorSlashed
But these don't cover:
- Stake weight changes (delegations, undelegations, magnitude updates) — no event trigger
- Manual admin ops (SeedTableCalculator, configureOperatorSet) — no event trigger
- Transporter failures — logged but retried only at next 24h tick
- operatorInfoTreeRoot changes in the table calculator — no event trigger
The core code path in crates/aggregator/src/core.rs:1438-1480 shows the problem: build_non_signer_witnesses calls
getOperatorSetWeights() on the source chain's table calculator to get LIVE weights, then builds a merkle tree from them. But the dest
chain verifier validates against the merkle root committed by the transporter (possibly hours ago).
Three Fix Strategies (Increasing Fundamentalness)
- Band-Aid: Reduce update_frequency + More Event Triggers
- Change default from 86400 to 300 (5 min)
- Add MagnitudeUpdated / weight-change events to SourceChainListener
- Still has a race window — NOT a fundamental fix
- Structural: Gateway Builds Certificates Against Verifier's Stored State
The gateway already reads latestReferenceTimestamp from the dest chain (line 1355-1359 in core.rs). Instead of then querying the
source chain for CURRENT weights, it should query the source chain at the BLOCK corresponding to that reference timestamp. This
requires archive RPC access.
- Fundamental: Shared Snapshot Between Transporter and Gateway
After each transport_update, the transporter persists the computed operator table + merkle tree to a shared store (Redis or Postgres).
The gateway reads this cached snapshot — not live source chain state — when building certificates. Both the dest chain verifier and
the gateway use identical data. Zero consistency gap.
Architecture change:
BEFORE (broken):
Gateway ─── reads CURRENT state ──→ Source Chain TableCalculator
Verifier ── reads STORED state ───→ Last Transporter Sync (hours ago)
AFTER (consistent):
Transporter ─── syncs ──→ Dest Chain Verifier
└── writes ──→ Shared Snapshot Store
Gateway ─── reads ──────→ Shared Snapshot Store (same data as verifier)
This eliminates the class of bugs entirely. The gateway can never build a certificate that disagrees with the verifier because they
read from the same snapshot.
Implementation Sketch for Option 3
- Transporter: After confirm_global_root + update_operator_table succeed, serialize the (operators, weights, merkle_root,
reference_timestamp) tuple to Redis key transporter:snapshot:{dest_chain_id} - Gateway: In build_non_signer_witnesses, read from Redis snapshot instead of calling getOperatorSetWeights() on the source chain
- Fallback: If snapshot is missing (first deploy, Redis down), fall back to current behavior + log a warning
- Health check: Gateway's /health endpoint already checks cert verifier staleness — extend it to also check snapshot freshness
Bonus: Why the "Quorum NOT reached" Error Message is Misleading
The actual failure is merkle proof verification in verifyCertificate, but it surfaces as "Quorum NOT reached" because the BLS
aggregation itself succeeds (operators DID sign), but the on-chain respondToTask transaction reverts when submitting the certificate —
the gateway interprets any submission failure as quorum failure.
Want me to dig deeper into any of these options, or should we start implementing Option 3 (shared snapshot)?
-
Reduce default to 3600 — agreed, straightforward config change.
-
How to map referenceTimestamp → source chain block
★ Insight ─────────────────────────────────────
The mapping already exists on-chain. The ECDSAOperatorTableUpdater stores a bidirectional mapping between referenceTimestamp and
referenceBlockNumber:
- _referenceBlockNumbers[referenceTimestamp] → source chain block number
- _referenceTimestamps[referenceBlockNumber] → source chain timestamp
- getLatestReferenceBlockNumber() → shortcut for current
Both are source chain values (the transporter reads source_chain.latest_block - 1 and passes both to confirmGlobalTableRoot). No
destination chain timestamp ambiguity.
So Option 2 (query at historical block) is viable WITHOUT archive nodes for the table calculator:
- Gateway reads latestReferenceTimestamp from dest chain verifier (already does this, line 1355)
- Gateway reads getReferenceBlockNumberByTimestamp(ts) from dest chain's ECDSAOperatorTableUpdater (new call, contract already
supports it) - Gateway calls getOperatorSetWeights(operatorSet).call().block(referenceBlockNumber) on the source chain table calculator at that
historical block
Catch: getOperatorSetWeights reads from the source chain's BN254TableCalculator storage. Calling .call().block(N) requires the source
chain RPC to serve state at block N. Standard (non-archive) nodes typically serve ~128-256 blocks of history on Ethereum. If
referenceBlockNumber is within that window, it works without archive. If the transporter synced recently (e.g., every 1h with the new
default), the block is always recent enough.
─────────────────────────────────────────────────
This makes Option 2 the simplest fundamental fix — it requires zero new infrastructure (no Redis, no shared store). Just one
additional contract read + one historical block query.
- Shared Snapshot (Option 3) — Risks
┌─────────────────────────────────────────────────┬──────────┬────────────────────────────────────────────────────────────────────┐
│ Risk │ Severity │ Mitigation │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Redis as new dependency for certificate path │ Medium │ Redis already required for chain watcher; fallback to current │
│ │ │ behavior if unavailable │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Snapshot staleness (transporter crashes, Redis │ Medium │ TTL on snapshot key; gateway detects missing/expired snapshot and │
│ write fails) │ │ falls back │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Data format coupling between transporter and │ Low │ Shared Rust struct in core crate, versioned │
│ gateway │ │ │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Ordering issue: snapshot written before │ High │ Must write to Redis AFTER confirm_global_root + │
│ on-chain root committed │ │ update_operator_table both succeed │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Extra operational surface (Redis key │ Low │ Single key per dest chain, auto-expires │
│ monitoring, pruning) │ │ │
└─────────────────────────────────────────────────┴──────────┴────────────────────────────────────────────────────────────────────┘
The ordering risk is the main concern — if the transporter writes the snapshot to Redis but confirm_global_root fails on-chain, the
gateway would use a snapshot that doesn't match the verifier. The write order MUST be: on-chain first, Redis second.
Recommendation: Option 2 First, Option 3 Later
Option 2 (historical block query) is simpler and has fewer failure modes:
BEFORE:
Gateway → getOperatorSetWeights() at CURRENT block ← MISMATCH
Verifier → validates against STORED merkle root from referenceBlockNumber
AFTER:
Gateway → getReferenceBlockNumberByTimestamp(latestRefTS) from dest chain
Gateway → getOperatorSetWeights().block(refBlockNumber) at SAME block ← MATCH
Verifier → validates against STORED merkle root from referenceBlockNumber
Changes needed:
- create_certificate_signature_data — read referenceBlockNumber from dest chain's ECDSAOperatorTableUpdater (one new contract call)
- build_non_signer_witnesses — pass referenceBlockNumber and use .call().block(refBlockNumber) on the source chain
getOperatorSetWeights call instead of .call() (current/latest) - Add ECDSAOperatorTableUpdater bindings to generated code (or use inline sol! like IOperatorTableCalculatorView)
No new infrastructure. No Redis dependency. No ordering concerns. The consistency is guaranteed by reading the same block the
transporter used.
Option 3 (shared snapshot) becomes relevant later when you need to decouple from archive-depth RPC requirements or optimize away the
extra contract read.
- Fixing the Misleading Error Message
The "Quorum NOT reached" message comes from the BLS aggregation layer which doesn't know about certificate verification. When
respondToTask reverts on-chain due to merkle proof failure, the error bubbles up as a generic submission failure. The gateway should:
- When certificate generation (create_certificate_signature_data) fails, surface the actual error: "BN254 certificate verification
failed: operator table mismatch between source chain (block X) and dest chain verifier (referenceTimestamp Y). Run transporter sync." - When respondToTask reverts with empty 0x data on a destination chain, check if it's a certificate issue rather than a quorum issue.
Want me to plan the implementation for Option 2 + the error message fix?
★ Insight ─────────────────────────────────────
Summary of all changes:
- Core fix (crates/aggregator/src/core.rs): create_certificate_signature_data now resolves referenceBlockNumber from the dest chain's
ECDSAOperatorTableUpdater and queries getOperatorSetWeights at that historical block. The gateway and the verifier now use identical
operator state — eliminating the consistency gap entirely. - Graceful fallback: If the updater address isn't configured (single-chain) or the historical block query fails, the code falls back
to current-block behavior with a warning. No breaking change for existing deployments. - Reduced exposure window (crates/transporter/src/config.rs): Default update_frequency reduced from 24h to 1h, meaning even in the
fallback case, staleness is bounded. - Actionable error messages (crates/gateway/src/rpc/api/sync.rs): Certificate failures now say "BN254 certificate generation failed"
with a transporter sync hint, instead of the misleading "Quorum NOT reached". - Documentation (.claude/rules/lessons.md): NEWT-845 entry updated with root cause, fix description, prevention invariant, and Option
3 (shared Redis snapshot) as a future consideration.
─────────────────────────────────────────────────
Files changed
┌───────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ File │ Change │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/aggregator/src/core.rs │ Historical block query in build_non_signer_witnesses, new dest_operator_table_updater │
│ │ field │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/aggregator/src/lib.rs │ Thread dest_operator_table_updater from config │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/task/aggregator.rs │ Thread dest_operator_table_updater from config │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/rpc/api/sync.rs │ Certificate-specific error messages with transporter sync hints │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/transporter/src/config.rs │ Default update_frequency: 86400 → 3600 │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/transporter/transporter.toml │ Default update_frequency: 86400 → 3600 │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ .claude/rules/lessons.md │ NEWT-845 entry expanded with fix details + Option 3 future consideration │
└───────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────┘
Want me to run make clippy and make fmt, or create a commit?
PR created: newt-foundation/newton-prover-avs#479
The uncommitted change warning is just the .gitignore modification (OMC directories) which we intentionally excluded from this PR.
★ Insight ─────────────────────────────────────
Recap of the full investigation and fix:
- Root cause: Gateway built BN254 certificates from live source chain state while the dest chain verifier validated against a
transporter snapshot (up to 24h old). Any weight change in between caused merkle proof mismatch → silent 0x revert → misleading
"Quorum NOT reached". - Fix: Read getReferenceBlockNumberByTimestamp from dest chain, query getOperatorSetWeights at that historical block. Zero new
infrastructure — just one extra contract read. - Band-aid: update_frequency 24h → 1h reduces fallback exposure window.
- Error UX: Certificate failures now say what they are with actionable transporter sync hints.
- Future: Option 3 (Redis shared snapshot) documented in lessons.md for archive-constrained environments.
The logs reveal a completely different issue. This is NOT a stale certificate problem. Look at the actual error:
Policy evaluation failed: Failed to validate params schema:
Schema validation failed: Missing required property 'max_daily_spend' at ''
Both operators (operator-1 and operator-2) fail with the same error. The policy schema requires max_daily_spend in the params, but the
intent's sender (0xd1168701...) is not in the policy's agents map — so no agent-specific params are found, and the schema validation
fails on the missing field.
★ Insight ─────────────────────────────────────
The "Quorum NOT reached" here is genuine — zero operators signed because all of them hit a policy evaluation error. This is the exact
scenario mentioned in lessons.md under "Unified quorum broadcast must collect and surface operator RPC errors." The error IS being
surfaced now (line 10 shows the full OperatorError with the RPC error message), but the top-level error still says "Quorum not
reached" which makes it look like the certificate issue.
The policy params (hex-decoded from line 108) show:
{
"agents": {
"0x5Bf6a3fD231ffF4BE6E1bBfFc31EF9531B19B7": { "max_daily_spend": 10000000000, ... },
"0xd93e791123D48A49eDE6d328232...": { "max_daily_spend": 1000000000, ... }
},
"default_policy": "deny"
}
The intent's from is 0xd1168701c5df97bdb65361aaf9318ae2c7e39312 — this address is NOT in the agents map. The policy evaluates with
default_policy: "deny" path, but before it gets there, the schema validator checks for max_daily_spend at the root level (not nested
under an agent key) and fails.
─────────────────────────────────────────────────
The stale certificate fix (transporter sync + PR #479) is irrelevant to this failure. The certificate generation never even runs
because BLS aggregation fails first — no operator signed, so there's nothing to aggregate.
The fix here is either:
- Add the sender 0xd1168701... to the policy's agents map with a max_daily_spend value
- Or use a sender address that's already in the agents map (0x5Bf6a3fD... or 0xd93e7911...)
- Or fix the Rego policy to handle the case where the sender isn't in the agents map without requiring max_daily_spend at the root
level
This is a policy configuration issue, not an infrastructure issue.
does that mean the changes on this PR branch is invalid? should we then just close this PR branch entirely?
⏺ No — the PR is still valid. These are two independent issues:
- The stale certificate bug (NEWT-845) — real, structural, and will recur on any destination chain whenever operator weights drift
between transporter syncs. PR #479 is the fundamental fix for this class of bugs. - The Datadog log you just shared — a policy configuration error (max_daily_spend missing for the sender address). This fails at the
operator Rego evaluation step, before BLS aggregation even starts. Certificate generation is never reached.
The confusing part is that both surface as "Quorum NOT reached." That's exactly why PR #479 also includes the error message
improvements — on destination chains, it now adds the transporter sync hint to help distinguish the two. In this case, the
operator_errors field in the response already shows the real cause (policy eval failure), which confirms it's not certificate-related.
★ Insight ─────────────────────────────────────
The stale certificate issue was the original problem you asked about ("Quorum not reached: Unified Quorum NOT reached for quorums [0]
(threshold: 40%)"). The transporter sync you just ran fixed the cert staleness. But the test task you submitted after the sync
happened to hit a different failure — the policy doesn't recognize the sender address. If you re-test with a sender that's in the
policy's agents map, the certificate path will be exercised and you'll see whether the sync actually resolved the original issue.
─────────────────────────────────────────────────
Keep the PR open. It fixes:
- The root cause of stale certificates (historical block query)
- The misleading error messages (which would have made this debugging session faster)
- The 24h→1h default sync interval
how do we 100% know that the changes/fixes in this PR is a strictly better/improvement than before? if we were investigating with
incorrect hypothesis, doesn't that mean our approach was not correct to begin with? please explain
⏺ Fair challenge. Let me be precise about what we know vs what we assumed.
What we know for certain
- The consistency gap exists in the code. This is provable by reading the code — build_non_signer_witnesses calls
getOperatorSetWeights() at the current block (line ~1448 in core.rs before this PR), while the BN254CertificateVerifier validates
against the merkle root from referenceBlockNumber (committed by the transporter). If these blocks have different operator state, the
merkle proof WILL fail. This is a logic bug, not a hypothesis. - The NEWT-845 lesson already documents this exact failure. It was observed and diagnosed before — the previous "fix" was "run
transporter sync manually." That's an operational workaround, not a code fix. - The error message improvement is strictly better. Before: "Quorum not reached" for both policy failures AND certificate failures.
After: certificate failures get a specific hint. This is why today's debugging was confusing — you couldn't tell which failure you
were hitting. - The update_frequency default of 24h was too long. 1h is strictly better — smaller staleness window, no downside.
What we did NOT prove
We did not reproduce the stale certificate failure in this session. The test task hit a policy config error before reaching the
certificate path. So we don't have end-to-end evidence that the historical block query fixes a live failure.
Why the fix is still correct
The historical block query is a tightening of an invariant, not a speculative change. Before:
gateway reads: getOperatorSetWeights() at block N (current)
verifier checks: merkle root from block M (transporter sync)
N != M → possible mismatch
After:
gateway reads: getOperatorSetWeights() at block M (from verifier)
verifier checks: merkle root from block M (transporter sync)
M == M → guaranteed match
The before-state has a race condition. The after-state eliminates it. Even if the race hasn't bitten you today, it's a latent bug that
fires whenever operator weights change between transporter syncs. The fallback behavior (current block) is preserved when the
historical query fails, so there's no regression risk.
How to get 100% confidence
Run make multichain-routing-e2e — this exercises the full certificate generation path on a destination chain with a known-good policy.
If it passes, the historical block query works end-to-end. The policy config error you hit on stagef is orthogonal to this code path.
Want me to run it?
Another question: do we already return the actual error issue/problem for failed policy evaluation task? (like specifically, things
like "Policy evaluation failed: Failed to validate params schema: Schema validation failed: Missing required property 'max_daily_spend'
at ''" that pinpoint the exact issue. I thought we already have the gateway to aggregate the operator error responses for their
failures/errors to report back to the client. Why does the client still only report high level generic error? Or is this a front end
side of the bug that it is not using the operator aggregated error responses returned by the gateway?
⏺ Good question. Let me check what the gateway actually returns vs what the client/explorer shows.
Searched for 2 patterns (ctrl+o to expand)
⏺ The gateway does return operator_errors in the JSON-RPC response. The data is there. Let me check how the explorer/frontend consumes
it.
Searched for 2 patterns (ctrl+o to expand)
⏺ There it is. Neither the dashboard API nor the indexer reads operator_errors from the gateway response.
★ Insight ─────────────────────────────────────
The data flow for failed tasks is:
- Gateway returns CreateTaskResponse with error: "Quorum not reached: ..." AND operator_errors: [{ operator_address, message: "Policy
evaluation failed: Missing required property 'max_daily_spend'" }] - Indexer writes the task to the database via the optimistic indexer DB write — but it only stores the top-level error field, not
operator_errors - Dashboard API reads from the database and serves it to the explorer
- Explorer frontend displays only the generic error string
The detailed per-operator errors are returned in the JSON-RPC response to the direct caller (SDK, CLI), but they're not persisted to
the indexer database and therefore never make it to the explorer UI. The explorer only sees "Quorum not reached: Unified Quorum NOT
reached for quorums [0] (threshold: 40%)".
─────────────────────────────────────────────────
So to answer your question: the gateway already does the right thing — it returns detailed operator_errors in the RPC response. The
bug is that the indexer/dashboard pipeline drops this information. A client calling the gateway RPC directly (e.g., via the Newton
SDK) would see the full error. The explorer doesn't because the indexer never stores operator_errors.
This is a separate issue from PR #479 — it's a dashboard/indexer gap. Want me to file it or look at what the indexer DB write
currently persists?
✻ Brewed for 34s
There it is. Both calls reveal the root cause:
★ Insight ─────────────────────────────────────
0xb9a620da = OperatorSetNotConfigured — the source chain's
BN254TableCalculator (0xFC5e...) on Sepolia doesn't have
the operator set configured. This is exactly the NEWT-845
pattern from lessons.md: the table calculator needs to be
seeded after deployment/redeployment.
The chain of failure:
configured → getOperatorSetInfo reverts
computes an invalid/empty root
Sepolia
because the root doesn't match any valid operator state
─────────────────────────────────────────────────
The dest chain's latestReferenceTimestamp also reverts
with empty 0x, likely because no valid root has ever been
confirmed for this operator set.
Fix: Seed the BN254TableCalculator on Sepolia.
Let me check what the seeding script expects —
specifically the operator set ID and AVS address.
Searched for 1 pattern, read 1 file (ctrl+o to expand)
⏺ Clear picture now. The fix is straightforward:
Step 1: Seed the BN254TableCalculator on Sepolia:
cd contracts && forge script
script/SeedTableCalculator.s.sol
--rpc-url "https://dimensional-few-slug.ethereum-sepolia
.quiknode.pro/850b4f6d0352ef457cda27de9f97354e01514afa"
--broadcast
--private-key 0xe61d9db0d49a63b93fbafcc7aa58705b0e0b4a1b
60d40d8ba9692dd459d8ce10
This will:
skips if already done)
AllocationManager on Sepolia
weight
Step 2: Sync to Base Sepolia via transporter:
make transport-sync network=testnet private_key=0xe61d9db0
d49a63b93fbafcc7aa58705b0e0b4a1b60d40d8ba9692dd459d8ce10
This pushes the new operator table root from Sepolia →
Base Sepolia's ECDSAOperatorTableUpdater.
★ Insight ─────────────────────────────────────
EigenLayer's version that reads live from
AllocationManager.getMembers(), Newton's version requires
explicit configureOperatorSet + registerOperator calls.
This is because destination chains can't read the source
chain's AllocationManager directly, so the state must be
snapshotted and synced.
source chain cascades into 0xc73a136a
(InvalidGlobalTableRoot) on the destination chain — the
transporter can't compute a valid root from an
unconfigured table calculator.
cross-chain contract redeployment or fresh
BN254TableCalculator deployment requires re-seeding.
─────────────────────────────────────────────────
Want me to run the seeding script, or would you prefer to
run it yourself?