Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created March 25, 2026 17:08
Show Gist options
  • Select an option

  • Save denniswon/d466df9c38324a9d4b3362bdd74604f7 to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/d466df9c38324a9d4b3362bdd74604f7 to your computer and use it in GitHub Desktop.
transporter sync issue: initial startup sync failed, continuing with event loop
★ Insight ─────────────────────────────────────
0xc73a136a = InvalidGlobalTableRoot() — this is from the OperatorTableUpdater contract (EigenLayer's
IOperatorTableUpdater interface).
─────────────────────────────────────────────────
The transporter is calling confirmGlobalTableRoot on the destination chain's ECDSAOperatorTableUpdater (Base Sepolia
84532), and the contract is rejecting the Merkle root being submitted.
This typically means one of:
1. Stale or mismatched operator table state — the source chain's BN254TableCalculator has operator weights/roots that
don't match what the transporter is computing. Per the lessons in lessons.md (NEWT-845), if weights changed on the
source chain without a successful sync, the roots diverge.
2. The root being submitted is zero or invalid — e.g., no operators are configured in the table calculator on the source
chain.
3. Reference block issue — the transporter uses latest_block - 1 for the reference timestamp. If the source chain state
at that block doesn't have a valid operator table, the computed root is invalid.
Let me check how the contract validates this.
Searched for 1 pattern (ctrl+o to expand)
⏺ The key line is ECDSAOperatorTableUpdater.sol:156:
require(_globalTableRoots[referenceTimestamp] == globalTableRoot, InvalidGlobalTableRoot());
This reverts during updateOperatorTable — the contract checks that the globalTableRoot passed by the transporter matches
the root previously stored via confirmGlobalTableRoot for that referenceTimestamp.
★ Insight ─────────────────────────────────────
The flow is two-step: (1) confirmGlobalTableRoot(referenceTimestamp, globalTableRoot) stores the root (owner-only), then
(2) updateOperatorTable(...) verifies per-operator Merkle proofs against that stored root. The error means either the
root was never stored for this timestamp, or the transporter is computing a different root than what's on-chain.
─────────────────────────────────────────────────
Most likely causes on stagef:
1. Operator table changed on source chain without re-syncing the root — if someone ran SeedTableCalculator or
registered/deregistered operators on Sepolia, the source chain's table root changed. The transporter computes the new
root locally but the destination chain still has the old root stored. The confirmGlobalTableRoot call succeeds (stores
new root), but if that step failed or was skipped, updateOperatorTable sees a mismatch.
2. confirmGlobalTableRoot itself is reverting — check if the transporter signer is still the owner of
ECDSAOperatorTableUpdater on Base Sepolia. Per lessons.md, if the owner doesn't match the transporter signer, root
commits fail silently upstream, leaving _globalTableRoots[referenceTimestamp] as zero.
Quick diagnostics to run:
# Check who owns the ECDSAOperatorTableUpdater on Base Sepolia
cast call <ECDSAOperatorTableUpdater_address> "owner()" --rpc-url <BASE_SEPOLIA_RPC>
# Compare against the transporter's signer address
# (from TRANSPORTER__SIGNER__PRIVATE_KEY in stagef config)
If the owner doesn't match the transporter signer, you'll need to call transferOwnership() from the current owner. If
they do match, the source chain table calculator state likely diverged — run make transport-sync network=testnet
private_key=0x... to force a fresh full sync.
@denniswon
Copy link
Copy Markdown
Author

There it is. Both calls reveal the root cause:

★ Insight ─────────────────────────────────────
0xb9a620da = OperatorSetNotConfigured — the source chain's
BN254TableCalculator (0xFC5e...) on Sepolia doesn't have
the operator set configured. This is exactly the NEWT-845
pattern from lessons.md: the table calculator needs to be
seeded after deployment/redeployment.

The chain of failure:

  1. Source chain BN254TableCalculator has no operator set
    configured → getOperatorSetInfo reverts
  2. Transporter can't compute a valid operator table →
    computes an invalid/empty root
  3. confirmGlobalTableRoot stores this invalid root on Base
    Sepolia
  4. updateOperatorTable fails with InvalidGlobalTableRoot
    because the root doesn't match any valid operator state
    ─────────────────────────────────────────────────

The dest chain's latestReferenceTimestamp also reverts
with empty 0x, likely because no valid root has ever been
confirmed for this operator set.

Fix: Seed the BN254TableCalculator on Sepolia.

Let me check what the seeding script expects —
specifically the operator set ID and AVS address.

Searched for 1 pattern, read 1 file (ctrl+o to expand)

⏺ Clear picture now. The fix is straightforward:

Step 1: Seed the BN254TableCalculator on Sepolia:

cd contracts && forge script
script/SeedTableCalculator.s.sol
--rpc-url "https://dimensional-few-slug.ethereum-sepolia
.quiknode.pro/850b4f6d0352ef457cda27de9f97354e01514afa"
--broadcast
--private-key 0xe61d9db0d49a63b93fbafcc7aa58705b0e0b4a1b
60d40d8ba9692dd459d8ce10

This will:

  1. Call configureOperatorSet on 0xFC5e... (idempotent —
    skips if already done)
  2. Fetch all registered operators from EigenLayer's
    AllocationManager on Sepolia
  3. Register each operator with their BLS pubkey and stake
    weight
  4. Compute and set the operatorInfoTreeRoot Merkle root

Step 2: Sync to Base Sepolia via transporter:

make transport-sync network=testnet private_key=0xe61d9db0
d49a63b93fbafcc7aa58705b0e0b4a1b60d40d8ba9692dd459d8ce10

This pushes the new operator table root from Sepolia →
Base Sepolia's ECDSAOperatorTableUpdater.

★ Insight ─────────────────────────────────────

  • The BN254TableCalculator is stateful — unlike
    EigenLayer's version that reads live from
    AllocationManager.getMembers(), Newton's version requires
    explicit configureOperatorSet + registerOperator calls.
    This is because destination chains can't read the source
    chain's AllocationManager directly, so the state must be
    snapshotted and synced.
  • The 0xb9a620da (OperatorSetNotConfigured) error on the
    source chain cascades into 0xc73a136a
    (InvalidGlobalTableRoot) on the destination chain — the
    transporter can't compute a valid root from an
    unconfigured table calculator.
  • This is a known pattern (NEWT-845 in lessons.md): any
    cross-chain contract redeployment or fresh
    BN254TableCalculator deployment requires re-seeding.
    ─────────────────────────────────────────────────

Want me to run the seeding script, or would you prefer to
run it yourself?

@denniswon
Copy link
Copy Markdown
Author

Interesting — the operator set is already configured and
both operators are already registered. The script failed
on the setOperatorInfoTreeRoot call because the earlier
reverts (caught by try/catch) still caused the overall
simulation to fail in Forge's strict mode.

But the key finding is: the source chain table calculator
state looks correct — operators are configured and
registered. The issue is likely that the
operatorInfoTreeRoot is stale or was never set after the
last operator change.

Let me check the current root on-chain vs what the script
computed.

⏺ Bash(# Check current operatorInfoTreeRoot on source chain
calculator
cast call 0xFC5e4999FF977EC4AE854dE714a365a332309cEd
"getOperatorSetInfo((address,uint32))" "(0x…)
⎿  0x000000000000000000000000000000000000000000000000
0000000000000020daeeb249d70d26f62be31764833eaf4165
d35d666c474c64ceb640b9642cdb5e00000000000000000000
… +8 lines (ctrl+o to expand)

⏺ The source chain is fine — getOperatorSetInfo returns
successfully with the correct root 0xdaeeb249... when
using the correct operator set (serviceManager, 0). My
earlier query used the wrong AVS address (transporter
signer instead of service manager).

The script failed because Forge's --broadcast mode treats
caught reverts as simulation failures. But the on-chain
state is already correct — operators are registered and
the root matches.

So the problem is specifically in the transporter's sync
to Base Sepolia. Let me check the transporter flow more
carefully.

Searched for 2 patterns, read 1 file (ctrl+o to expand)

⏺ I see the flow now. Let me check the updater to understand
what's happening at the contract call level.

Read 1 file (ctrl+o to expand)

⏺ Found the bug. Look at confirm_global_root in
updater.rs:43-51:

if !receipt.status() {
warn!(
"confirmGlobalTableRoot reverted for root={:?},
tx_hash={}",
root, receipt.transaction_hash
);
} else {
info!("confirmed global table root: {:?}", root);
}
Ok(receipt) // <-- Always returns Ok, even on revert!

★ Insight ─────────────────────────────────────
This is a silent-revert bug. When confirmGlobalTableRoot
reverts on-chain (e.g., GlobalTableRootStale because the
referenceTimestamp is not strictly greater than the last
confirmed timestamp), the transporter logs a warning but
returns Ok(receipt) instead of an error. It then proceeds
to call updateOperatorTable, which checks
_globalTableRoots[referenceTimestamp] == globalTableRoot —
but no root was stored for that timestamp because the
first call reverted. Result: InvalidGlobalTableRoot.

The same pattern exists in update_operator_table (line
72-79) — also swallows reverts.
─────────────────────────────────────────────────

The chain of events on stagef:

  1. Transporter picks latest_block - 1 on Sepolia → gets
    referenceTimestamp
  2. Calls confirmGlobalTableRoot(root, timestamp,
    blockNumber) → reverts (likely GlobalTableRootStale
    because a previous sync already confirmed a root with a
    timestamp >= this one)
  3. Transporter logs a warn! but continues because it
    returns Ok(receipt)
  4. Calls updateOperatorTable(timestamp, root, ...) →
    reverts with InvalidGlobalTableRoot because no root exists
    at that timestamp
  5. This error propagates (the .send().await? in
    update_operator_table throws), and that's what you see in
    the logs

There are two things to fix:

  1. Immediate operational fix: Check the transporter logs
    for the warn! from confirmGlobalTableRoot reverted. The
    underlying revert is likely GlobalTableRootStale — meaning
    the _latestReferenceTimestamp on Base Sepolia is already

= what the transporter is trying to confirm. This could
happen if another transporter instance already synced, or
if the transporter daemon is running with a too-short
interval relative to Sepolia's block time.

  1. Code fix: Both confirm_global_root and
    update_operator_table need to return an error when
    !receipt.status(), not just log a warning. Want me to fix
    this?

@denniswon
Copy link
Copy Markdown
Author

Compiles clean. Now let me also add this to lessons
learned since it's a recurring pattern (silent reverts).

★ Insight ─────────────────────────────────────

  • Silent revert pattern: When using alloy's
    .send().await?.get_receipt().await?, the ? only propagates
    RPC/network errors — it does NOT check whether the
    transaction itself reverted on-chain. You must explicitly
    check receipt.status() and return an error. This is
    different from .call().await? which DOES propagate
    contract reverts.
  • Cascading failure: The confirmGlobalTableRoot revert was
    silent, so the transporter proceeded to
    updateOperatorTable which failed with a misleading
    InvalidGlobalTableRoot — the real error was upstream
    (likely GlobalTableRootStale).
    ─────────────────────────────────────────────────

@denniswon
Copy link
Copy Markdown
Author

Done. Here's a summary of what was done and what remains:

Code fix (crates/transporter/src/updater.rs):

  • confirm_global_root — now returns Err when
    !receipt.status() instead of logging a warning and
    continuing
  • update_operator_table — same fix
  • Removed unused warn import

Lesson added (.claude/rules/lessons.md):

  • Documented the alloy .send().get_receipt() silent-revert
    pattern under Code Patterns

Operational status on stagef:

  • Source chain (Sepolia) table calculator is correctly
    configured — operators registered, operatorInfoTreeRoot
    set
  • The transporter is failing because
    confirmGlobalTableRoot reverts (likely
    GlobalTableRootStale — the referenceTimestamp is not
    strictly greater than the last confirmed one) and the code
    was silently continuing
  • After deploying this fix, the transporter will properly
    fail-fast on confirmGlobalTableRoot revert and surface the
    real error in logs
  • You may still need to investigate why
    confirmGlobalTableRoot is reverting — likely the
    _latestReferenceTimestamp on Base Sepolia is already ahead
    of what the transporter is computing, possibly from a
    concurrent or recent sync

@denniswon
Copy link
Copy Markdown
Author

Progress. confirmGlobalTableRoot succeeded — the root was stored on Base Sepolia. The failure is now nonce too low: next
nonce 15, tx nonce 14 on the updateOperatorTable call.

★ Insight ─────────────────────────────────────
This is the nonce race documented in lessons.md: the confirmGlobalTableRoot tx used nonce 14 and was mined, but the
local nonce cache still thinks the next nonce is 14 (not 15). The CachedNonceManager queried
eth_getTransactionCount("pending") at the start and cached it — but by the time updateOperatorTable tries to send, the
first tx has already been mined and incremented the on-chain nonce.
─────────────────────────────────────────────────

Let me just retry — the nonce cache will refresh on the second attempt, and confirmGlobalTableRoot will fail with
GlobalTableRootStale (same root already stored), which our code fix now properly surfaces. But actually, the timestamp
will be newer on retry so it should work.

⏺ Bash(sleep 15 && RPC_HTTP_URL_11155111="https://dimensional-few-slug.ethereum-sepolia.quiknode.pro/850b4f6d0352ef457cda2
7de9f97354e01514afa"
RPC_HTTP_URL_84532="ht…)
⎿   Finished dev profile [unoptimized + debuginfo] target(s) in 0.67s
Running target/debug/newton-prover-transporter sync --network testnet
2026-03-25T17:34:18.880759Z INFO newton_prover_transporter::commands: starting one-shot transport sync to 1 des
… +29 lines (ctrl+o to expand)
⎿  (timeout 5m)

⏺ Transport sync succeeded:

confirmed global table root: 0xb815dcc2...
updated operator table at index 0
transport sync complete: success=1 failed=0 total=1

Summary of what happened and what was fixed:

  1. Root cause of the original InvalidGlobalTableRoot error: Most likely the stagef transporter ECS service was hitting
    RPC rate limits (429s) from default/shared RPC endpoints. When RPC calls fail mid-computation, the transporter produces
    an invalid root. Combined with the silent-revert bug, the real error was masked.
  2. Code fix (crates/transporter/src/updater.rs): confirm_global_root and update_operator_table now return Err on
    reverted receipts instead of swallowing them. This prevents the cascading InvalidGlobalTableRoot error when
    confirmGlobalTableRoot actually reverts for a different reason (e.g., GlobalTableRootStale).
  3. Operational fix: The transporter needed proper RPC_HTTP_URL_11155111 and RPC_HTTP_URL_84532 env vars pointing to
    endpoints with adequate rate limits. The stagef ECS task definition should have these configured.
  4. State is now healthy: The operator table root is confirmed on Base Sepolia and the operator table is updated at index
  5. The transporter should continue working on subsequent sync intervals.

@denniswon
Copy link
Copy Markdown
Author


★ Insight ─────────────────────────────────────

The "Quorum NOT reached" on destination chains is a consistency gap, not a config problem. Three independent data sources must agree
but are read at different times:

  1. Gateway reads CURRENT operator weights from source chain's BN254TableCalculator.getOperatorSetWeights() to build merkle proofs
  2. Destination chain's BN254CertificateVerifier validates against its STORED merkle root (from the last transporter sync)
  3. Transporter syncs on a 24-hour interval (update_frequency = 86400) or on OperatorAdded/OperatorRemoved/OperatorSlashed events

Any change between syncs — stake weight updates, delegations, admin operations — causes the gateway's freshly-computed merkle tree to
differ from the verifier's stored root. The merkle proof fails silently (empty 0x revert), which surfaces as "Quorum NOT reached."

─────────────────────────────────────────────────

Why It Keeps Recurring

The transporter's event listener (crates/transporter/src/listener.rs) only watches for three AllocationManager events:

  • OperatorAddedToOperatorSet
  • OperatorRemovedFromOperatorSet
  • OperatorSlashed

But these don't cover:

  • Stake weight changes (delegations, undelegations, magnitude updates) — no event trigger
  • Manual admin ops (SeedTableCalculator, configureOperatorSet) — no event trigger
  • Transporter failures — logged but retried only at next 24h tick
  • operatorInfoTreeRoot changes in the table calculator — no event trigger

The core code path in crates/aggregator/src/core.rs:1438-1480 shows the problem: build_non_signer_witnesses calls
getOperatorSetWeights() on the source chain's table calculator to get LIVE weights, then builds a merkle tree from them. But the dest
chain verifier validates against the merkle root committed by the transporter (possibly hours ago).

Three Fix Strategies (Increasing Fundamentalness)

  1. Band-Aid: Reduce update_frequency + More Event Triggers
  • Change default from 86400 to 300 (5 min)
  • Add MagnitudeUpdated / weight-change events to SourceChainListener
  • Still has a race window — NOT a fundamental fix
  1. Structural: Gateway Builds Certificates Against Verifier's Stored State

The gateway already reads latestReferenceTimestamp from the dest chain (line 1355-1359 in core.rs). Instead of then querying the
source chain for CURRENT weights, it should query the source chain at the BLOCK corresponding to that reference timestamp. This
requires archive RPC access.

  1. Fundamental: Shared Snapshot Between Transporter and Gateway

After each transport_update, the transporter persists the computed operator table + merkle tree to a shared store (Redis or Postgres).
The gateway reads this cached snapshot — not live source chain state — when building certificates. Both the dest chain verifier and
the gateway use identical data. Zero consistency gap.

Architecture change:
BEFORE (broken):
Gateway ─── reads CURRENT state ──→ Source Chain TableCalculator
Verifier ── reads STORED state ───→ Last Transporter Sync (hours ago)

AFTER (consistent):
Transporter ─── syncs ──→ Dest Chain Verifier
└── writes ──→ Shared Snapshot Store
Gateway ─── reads ──────→ Shared Snapshot Store (same data as verifier)

This eliminates the class of bugs entirely. The gateway can never build a certificate that disagrees with the verifier because they
read from the same snapshot.

Implementation Sketch for Option 3

  1. Transporter: After confirm_global_root + update_operator_table succeed, serialize the (operators, weights, merkle_root,
    reference_timestamp) tuple to Redis key transporter:snapshot:{dest_chain_id}
  2. Gateway: In build_non_signer_witnesses, read from Redis snapshot instead of calling getOperatorSetWeights() on the source chain
  3. Fallback: If snapshot is missing (first deploy, Redis down), fall back to current behavior + log a warning
  4. Health check: Gateway's /health endpoint already checks cert verifier staleness — extend it to also check snapshot freshness

Bonus: Why the "Quorum NOT reached" Error Message is Misleading

The actual failure is merkle proof verification in verifyCertificate, but it surfaces as "Quorum NOT reached" because the BLS
aggregation itself succeeds (operators DID sign), but the on-chain respondToTask transaction reverts when submitting the certificate —
the gateway interprets any submission failure as quorum failure.


Want me to dig deeper into any of these options, or should we start implementing Option 3 (shared snapshot)?

@denniswon
Copy link
Copy Markdown
Author


  1. Reduce default to 3600 — agreed, straightforward config change.

  2. How to map referenceTimestamp → source chain block

★ Insight ─────────────────────────────────────

The mapping already exists on-chain. The ECDSAOperatorTableUpdater stores a bidirectional mapping between referenceTimestamp and
referenceBlockNumber:

  • _referenceBlockNumbers[referenceTimestamp] → source chain block number
  • _referenceTimestamps[referenceBlockNumber] → source chain timestamp
  • getLatestReferenceBlockNumber() → shortcut for current

Both are source chain values (the transporter reads source_chain.latest_block - 1 and passes both to confirmGlobalTableRoot). No
destination chain timestamp ambiguity.

So Option 2 (query at historical block) is viable WITHOUT archive nodes for the table calculator:

  1. Gateway reads latestReferenceTimestamp from dest chain verifier (already does this, line 1355)
  2. Gateway reads getReferenceBlockNumberByTimestamp(ts) from dest chain's ECDSAOperatorTableUpdater (new call, contract already
    supports it)
  3. Gateway calls getOperatorSetWeights(operatorSet).call().block(referenceBlockNumber) on the source chain table calculator at that
    historical block

Catch: getOperatorSetWeights reads from the source chain's BN254TableCalculator storage. Calling .call().block(N) requires the source
chain RPC to serve state at block N. Standard (non-archive) nodes typically serve ~128-256 blocks of history on Ethereum. If
referenceBlockNumber is within that window, it works without archive. If the transporter synced recently (e.g., every 1h with the new
default), the block is always recent enough.

─────────────────────────────────────────────────

This makes Option 2 the simplest fundamental fix — it requires zero new infrastructure (no Redis, no shared store). Just one
additional contract read + one historical block query.

  1. Shared Snapshot (Option 3) — Risks

┌─────────────────────────────────────────────────┬──────────┬────────────────────────────────────────────────────────────────────┐
│ Risk │ Severity │ Mitigation │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Redis as new dependency for certificate path │ Medium │ Redis already required for chain watcher; fallback to current │
│ │ │ behavior if unavailable │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Snapshot staleness (transporter crashes, Redis │ Medium │ TTL on snapshot key; gateway detects missing/expired snapshot and │
│ write fails) │ │ falls back │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Data format coupling between transporter and │ Low │ Shared Rust struct in core crate, versioned │
│ gateway │ │ │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Ordering issue: snapshot written before │ High │ Must write to Redis AFTER confirm_global_root + │
│ on-chain root committed │ │ update_operator_table both succeed │
├─────────────────────────────────────────────────┼──────────┼────────────────────────────────────────────────────────────────────┤
│ Extra operational surface (Redis key │ Low │ Single key per dest chain, auto-expires │
│ monitoring, pruning) │ │ │
└─────────────────────────────────────────────────┴──────────┴────────────────────────────────────────────────────────────────────┘

The ordering risk is the main concern — if the transporter writes the snapshot to Redis but confirm_global_root fails on-chain, the
gateway would use a snapshot that doesn't match the verifier. The write order MUST be: on-chain first, Redis second.

Recommendation: Option 2 First, Option 3 Later

Option 2 (historical block query) is simpler and has fewer failure modes:

BEFORE:
Gateway → getOperatorSetWeights() at CURRENT block ← MISMATCH
Verifier → validates against STORED merkle root from referenceBlockNumber

AFTER:
Gateway → getReferenceBlockNumberByTimestamp(latestRefTS) from dest chain
Gateway → getOperatorSetWeights().block(refBlockNumber) at SAME block ← MATCH
Verifier → validates against STORED merkle root from referenceBlockNumber

Changes needed:

  1. create_certificate_signature_data — read referenceBlockNumber from dest chain's ECDSAOperatorTableUpdater (one new contract call)
  2. build_non_signer_witnesses — pass referenceBlockNumber and use .call().block(refBlockNumber) on the source chain
    getOperatorSetWeights call instead of .call() (current/latest)
  3. Add ECDSAOperatorTableUpdater bindings to generated code (or use inline sol! like IOperatorTableCalculatorView)

No new infrastructure. No Redis dependency. No ordering concerns. The consistency is guaranteed by reading the same block the
transporter used.

Option 3 (shared snapshot) becomes relevant later when you need to decouple from archive-depth RPC requirements or optimize away the
extra contract read.

  1. Fixing the Misleading Error Message

The "Quorum NOT reached" message comes from the BLS aggregation layer which doesn't know about certificate verification. When
respondToTask reverts on-chain due to merkle proof failure, the error bubbles up as a generic submission failure. The gateway should:

  1. When certificate generation (create_certificate_signature_data) fails, surface the actual error: "BN254 certificate verification
    failed: operator table mismatch between source chain (block X) and dest chain verifier (referenceTimestamp Y). Run transporter sync."
  2. When respondToTask reverts with empty 0x data on a destination chain, check if it's a certificate issue rather than a quorum issue.

Want me to plan the implementation for Option 2 + the error message fix?

@denniswon
Copy link
Copy Markdown
Author

★ Insight ─────────────────────────────────────

Summary of all changes:

  1. Core fix (crates/aggregator/src/core.rs): create_certificate_signature_data now resolves referenceBlockNumber from the dest chain's
    ECDSAOperatorTableUpdater and queries getOperatorSetWeights at that historical block. The gateway and the verifier now use identical
    operator state — eliminating the consistency gap entirely.
  2. Graceful fallback: If the updater address isn't configured (single-chain) or the historical block query fails, the code falls back
    to current-block behavior with a warning. No breaking change for existing deployments.
  3. Reduced exposure window (crates/transporter/src/config.rs): Default update_frequency reduced from 24h to 1h, meaning even in the
    fallback case, staleness is bounded.
  4. Actionable error messages (crates/gateway/src/rpc/api/sync.rs): Certificate failures now say "BN254 certificate generation failed"
    with a transporter sync hint, instead of the misleading "Quorum NOT reached".
  5. Documentation (.claude/rules/lessons.md): NEWT-845 entry updated with root cause, fix description, prevention invariant, and Option
    3 (shared Redis snapshot) as a future consideration.

─────────────────────────────────────────────────

Files changed

┌───────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ File │ Change │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/aggregator/src/core.rs │ Historical block query in build_non_signer_witnesses, new dest_operator_table_updater │
│ │ field │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/aggregator/src/lib.rs │ Thread dest_operator_table_updater from config │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/task/aggregator.rs │ Thread dest_operator_table_updater from config │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/gateway/src/rpc/api/sync.rs │ Certificate-specific error messages with transporter sync hints │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/transporter/src/config.rs │ Default update_frequency: 86400 → 3600 │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ crates/transporter/transporter.toml │ Default update_frequency: 86400 → 3600 │
├───────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ .claude/rules/lessons.md │ NEWT-845 entry expanded with fix details + Option 3 future consideration │
└───────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

Want me to run make clippy and make fmt, or create a commit?

@denniswon
Copy link
Copy Markdown
Author

PR created: newt-foundation/newton-prover-avs#479

The uncommitted change warning is just the .gitignore modification (OMC directories) which we intentionally excluded from this PR.

★ Insight ─────────────────────────────────────

Recap of the full investigation and fix:

  • Root cause: Gateway built BN254 certificates from live source chain state while the dest chain verifier validated against a
    transporter snapshot (up to 24h old). Any weight change in between caused merkle proof mismatch → silent 0x revert → misleading
    "Quorum NOT reached".
  • Fix: Read getReferenceBlockNumberByTimestamp from dest chain, query getOperatorSetWeights at that historical block. Zero new
    infrastructure — just one extra contract read.
  • Band-aid: update_frequency 24h → 1h reduces fallback exposure window.
  • Error UX: Certificate failures now say what they are with actionable transporter sync hints.
  • Future: Option 3 (Redis shared snapshot) documented in lessons.md for archive-constrained environments.

@denniswon
Copy link
Copy Markdown
Author

The logs reveal a completely different issue. This is NOT a stale certificate problem. Look at the actual error:

Policy evaluation failed: Failed to validate params schema:
Schema validation failed: Missing required property 'max_daily_spend' at ''

Both operators (operator-1 and operator-2) fail with the same error. The policy schema requires max_daily_spend in the params, but the
intent's sender (0xd1168701...) is not in the policy's agents map — so no agent-specific params are found, and the schema validation
fails on the missing field.

★ Insight ─────────────────────────────────────

The "Quorum NOT reached" here is genuine — zero operators signed because all of them hit a policy evaluation error. This is the exact
scenario mentioned in lessons.md under "Unified quorum broadcast must collect and surface operator RPC errors." The error IS being
surfaced now (line 10 shows the full OperatorError with the RPC error message), but the top-level error still says "Quorum not
reached" which makes it look like the certificate issue.

The policy params (hex-decoded from line 108) show:
{
"agents": {
"0x5Bf6a3fD231ffF4BE6E1bBfFc31EF9531B19B7": { "max_daily_spend": 10000000000, ... },
"0xd93e791123D48A49eDE6d328232...": { "max_daily_spend": 1000000000, ... }
},
"default_policy": "deny"
}

The intent's from is 0xd1168701c5df97bdb65361aaf9318ae2c7e39312 — this address is NOT in the agents map. The policy evaluates with
default_policy: "deny" path, but before it gets there, the schema validator checks for max_daily_spend at the root level (not nested
under an agent key) and fails.

─────────────────────────────────────────────────

The stale certificate fix (transporter sync + PR #479) is irrelevant to this failure. The certificate generation never even runs
because BLS aggregation fails first — no operator signed, so there's nothing to aggregate.

The fix here is either:

  1. Add the sender 0xd1168701... to the policy's agents map with a max_daily_spend value
  2. Or use a sender address that's already in the agents map (0x5Bf6a3fD... or 0xd93e7911...)
  3. Or fix the Rego policy to handle the case where the sender isn't in the agents map without requiring max_daily_spend at the root
    level

This is a policy configuration issue, not an infrastructure issue.

@denniswon
Copy link
Copy Markdown
Author

does that mean the changes on this PR branch is invalid? should we then just close this PR branch entirely?

⏺ No — the PR is still valid. These are two independent issues:

  1. The stale certificate bug (NEWT-845) — real, structural, and will recur on any destination chain whenever operator weights drift
    between transporter syncs. PR #479 is the fundamental fix for this class of bugs.
  2. The Datadog log you just shared — a policy configuration error (max_daily_spend missing for the sender address). This fails at the
    operator Rego evaluation step, before BLS aggregation even starts. Certificate generation is never reached.

The confusing part is that both surface as "Quorum NOT reached." That's exactly why PR #479 also includes the error message
improvements — on destination chains, it now adds the transporter sync hint to help distinguish the two. In this case, the
operator_errors field in the response already shows the real cause (policy eval failure), which confirms it's not certificate-related.

★ Insight ─────────────────────────────────────

The stale certificate issue was the original problem you asked about ("Quorum not reached: Unified Quorum NOT reached for quorums [0]
(threshold: 40%)"). The transporter sync you just ran fixed the cert staleness. But the test task you submitted after the sync
happened to hit a different failure — the policy doesn't recognize the sender address. If you re-test with a sender that's in the
policy's agents map, the certificate path will be exercised and you'll see whether the sync actually resolved the original issue.

─────────────────────────────────────────────────

Keep the PR open. It fixes:

  • The root cause of stale certificates (historical block query)
  • The misleading error messages (which would have made this debugging session faster)
  • The 24h→1h default sync interval

@denniswon
Copy link
Copy Markdown
Author

how do we 100% know that the changes/fixes in this PR is a strictly better/improvement than before? if we were investigating with
incorrect hypothesis, doesn't that mean our approach was not correct to begin with? please explain

⏺ Fair challenge. Let me be precise about what we know vs what we assumed.

What we know for certain

  1. The consistency gap exists in the code. This is provable by reading the code — build_non_signer_witnesses calls
    getOperatorSetWeights() at the current block (line ~1448 in core.rs before this PR), while the BN254CertificateVerifier validates
    against the merkle root from referenceBlockNumber (committed by the transporter). If these blocks have different operator state, the
    merkle proof WILL fail. This is a logic bug, not a hypothesis.
  2. The NEWT-845 lesson already documents this exact failure. It was observed and diagnosed before — the previous "fix" was "run
    transporter sync manually." That's an operational workaround, not a code fix.
  3. The error message improvement is strictly better. Before: "Quorum not reached" for both policy failures AND certificate failures.
    After: certificate failures get a specific hint. This is why today's debugging was confusing — you couldn't tell which failure you
    were hitting.
  4. The update_frequency default of 24h was too long. 1h is strictly better — smaller staleness window, no downside.

What we did NOT prove

We did not reproduce the stale certificate failure in this session. The test task hit a policy config error before reaching the
certificate path. So we don't have end-to-end evidence that the historical block query fixes a live failure.

Why the fix is still correct

The historical block query is a tightening of an invariant, not a speculative change. Before:

gateway reads: getOperatorSetWeights() at block N (current)
verifier checks: merkle root from block M (transporter sync)
N != M → possible mismatch

After:

gateway reads: getOperatorSetWeights() at block M (from verifier)
verifier checks: merkle root from block M (transporter sync)
M == M → guaranteed match

The before-state has a race condition. The after-state eliminates it. Even if the race hasn't bitten you today, it's a latent bug that
fires whenever operator weights change between transporter syncs. The fallback behavior (current block) is preserved when the
historical query fails, so there's no regression risk.

How to get 100% confidence

Run make multichain-routing-e2e — this exercises the full certificate generation path on a destination chain with a known-good policy.
If it passes, the historical block query works end-to-end. The policy config error you hit on stagef is orthogonal to this code path.

Want me to run it?

@denniswon
Copy link
Copy Markdown
Author

Another question: do we already return the actual error issue/problem for failed policy evaluation task? (like specifically, things
like "Policy evaluation failed: Failed to validate params schema: Schema validation failed: Missing required property 'max_daily_spend'
at ''" that pinpoint the exact issue. I thought we already have the gateway to aggregate the operator error responses for their
failures/errors to report back to the client. Why does the client still only report high level generic error? Or is this a front end
side of the bug that it is not using the operator aggregated error responses returned by the gateway?

⏺ Good question. Let me check what the gateway actually returns vs what the client/explorer shows.

Searched for 2 patterns (ctrl+o to expand)

⏺ The gateway does return operator_errors in the JSON-RPC response. The data is there. Let me check how the explorer/frontend consumes
it.

Searched for 2 patterns (ctrl+o to expand)

⏺ There it is. Neither the dashboard API nor the indexer reads operator_errors from the gateway response.

★ Insight ─────────────────────────────────────

The data flow for failed tasks is:

  1. Gateway returns CreateTaskResponse with error: "Quorum not reached: ..." AND operator_errors: [{ operator_address, message: "Policy
    evaluation failed: Missing required property 'max_daily_spend'" }]
  2. Indexer writes the task to the database via the optimistic indexer DB write — but it only stores the top-level error field, not
    operator_errors
  3. Dashboard API reads from the database and serves it to the explorer
  4. Explorer frontend displays only the generic error string

The detailed per-operator errors are returned in the JSON-RPC response to the direct caller (SDK, CLI), but they're not persisted to
the indexer database and therefore never make it to the explorer UI. The explorer only sees "Quorum not reached: Unified Quorum NOT
reached for quorums [0] (threshold: 40%)".

─────────────────────────────────────────────────

So to answer your question: the gateway already does the right thing — it returns detailed operator_errors in the RPC response. The
bug is that the indexer/dashboard pipeline drops this information. A client calling the gateway RPC directly (e.g., via the Newton
SDK) would see the full error. The explorer doesn't because the indexer never stores operator_errors.

This is a separate issue from PR #479 — it's a dashboard/indexer gap. Want me to file it or look at what the indexer DB write
currently persists?

✻ Brewed for 34s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment