Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save hungson175/1d73015688d64190636a71ea2d4e76ca to your computer and use it in GitHub Desktop.

Select an option

Save hungson175/1d73015688d64190636a71ea2d4e76ca to your computer and use it in GitHub Desktop.
MOAT Eval Report v10 — CTO discussion branch 2026-05-09

MOAT Data Agent — Evaluation Process Design

Version: v10 | Prepared by: Son Pham | Date: 2026-05-08


Executive Summary

MOAT is MoMo's AI data agent that answers business questions in natural language against BigQuery. This document describes the evaluation framework — both how we measure quality (IFAB) and how we continuously improve without degrading what already works.

Key points:

  • Evaluation starts Friday May 15, 12:00 — first release for feedback. Everyone invited to critique.
  • Two evaluation modes: offline (controlled, against known queries) and online (live production signals)
  • The knowledge base is human-gated: domain team leader approves all updates
  • DA time allocation is open — input needed from data managers and leadership

This is a design document. KPIs, milestones, and resource targets are deliberately open — your input is needed to complete the next version.


1. Evaluation Criteria — IFAB

IFAB evaluates query logic, not raw numbers — so it works even when result snapshots are time-sensitive. Adapted from existing frameworks, refined for MoMo's use case. Weights are a starting point, being validated with the data team.

Component Points What it checks Why it matters
I — Intent 4 Did the agent understand what the user is actually asking for? Highest weight — if intent is wrong, the answer is confidently wrong. Worst failure mode.
F — Filter 3 Did it apply the right scope (which records / time range / segment to include)? Wrong filter = right shape, wrong slice. Numbers look reasonable but answer a different question.
A — Aggregation 2 Did it summarize the data correctly (totals vs averages, by-customer vs overall)? Wrong aggregation makes the result meaningless even if filters and intent are right.
B — Business rule 1 Did it apply MoMo-specific definitions (how "active user", "transaction", etc. are defined here)? Lowest weight because business rules are the most learnable — DAs can teach them via corrections. Not foundational, but essential for production trust.
Returned Numbers 0–3 Exact / close / wrong / completely wrong The reality check — does the final number match the golden answer.
Total 13 Weighted 70% IFAB : 30% Numbers

2. Offline Evaluation — Dataset Design

Two golden sets, one hard rule

Set Definition Rule
G_core Frozen golden queries — never changes Regression baseline. Comparable forever.
G_new Growing set — DAs add queries weekly Coverage expansion. Tracks new domains.

Hard rule: G_core failures must NOT be used as direct input to update K. Otherwise the agent just memorizes the test. K updates come from G_new queries and online signals — G_core stays a clean held-out benchmark.

G_core is immutable — same queries every Friday. This is what makes week-over-week comparison meaningful.

G_new grows weekly as DA teams add new queries per domain (~60 domains across the company, coordinated by the data managers). Domain coverage is a manager-level responsibility.

Weekly Regression — Every Friday 15 May onwards

flowchart LR
    A[DA Teams\nadd to G_new weekly] --> B[(G_core — frozen\nG_new — growing)]
    B --> C[Friday 12:00\nAuto-run]
    C --> D[MOAT scores\neach query — IFAB]
    D --> E[Regression score\nIFAB on G_core\nweek-over-week]
    D --> F[Coverage score\nIFAB on G_new\nnew domains]
    E --> G[Management Report\nflagged regressions]
    F --> G
    D --> H[DA Report\npoorly scored queries\n+ AI diagnosis]
    H --> I[DA fills correction form]
Loading

Management View — Friday Report

Per domain: regression score on G_core, week-over-week delta, coverage score on G_new. Domains with regressions or low absolute scores are flagged automatically.

Domain Regression (G_core) vs last week Coverage (G_new) Status
Túi Thần Tài 95% +1% 92%
Vay Nhanh 91% +0% 88%
Bank Partnership 89% -1% 85%
VTS / PayLater 72% -8% 60% 🚩 Regression
(60 domains total — onboarding progressively)

How to read this table:

  • Regression (G_core) = score on the frozen golden set. The honest signal of whether quality is holding.
  • vs last week = the comparable delta. Green if stable or improving, red if dropping.
  • Coverage (G_new) = score on this week's new queries. Tracks how well the agent handles fresh domain knowledge.
  • 🚩 Regression triggers an automatic alert — the domain manager and team leader investigate immediately.

3. Knowledge Update — K' → K

When a query scores poorly, the DA submits a correction. The promotion to live K is gated by the domain team leader, not an automated check.

DA Correction Form

Field Content
Golden SQL The correct query that should have been generated
Explanation Why the agent got it wrong
Agent log (optional) Agent reasoning steps — provided automatically

Promotion process

flowchart LR
    A[DA correction form] --> B[LLM extracts\nstructured knowledge]
    B --> C[(K' — staging)]
    C --> D[Agent re-tests\nonly the failing query q]
    D -->|Pass| E[Team leader\nreviews knowledge content\n+ approves]
    D -->|Fail| A
    E --> F[(K — live system)]
Loading

Why no automated G_core check here? Running the full agent across G_core for every correction would mean hundreds of LLM calls per fix — too expensive, too slow, and unnecessary. The team leader's domain expertise is the quality gate. Friday's G_core regression is the safety net that catches any drift introduced during the week.

Conflict resolution: Domain team leader resolves conflicts between DA submissions.

G_core versioning

Every 1 month: G_core is expanded by promoting stable queries from G_new. When this happens, the version is bumped — scores before and after a version bump are not directly comparable.

G_core_v2 = G_core_v1 ∪ { validated queries from G_new }


4. Online Evaluation

What it is: Monitoring real business user queries in production between weekly tests. MOAT users are not Data Analysts — they are marketing execs, product owners, and bosses who currently ask DAs for data.

Note on feedback signals: Thumb up/down is meaningless for internal tools (~5–8% in consumer apps, near-zero here). We measure whether users get answers without bothering DAs.

flowchart LR
    A[Business user sends query] --> B[MOAT answers]
    B --> C{System signals}
    C -->|Got answer?| D[Log success]
    C -->|Gave up?| E[Log escalation to DA]
    C -->|/feedback command| F[Routed to Domain DA]
    B -->|SQL failed?| G[Log error]
    E --> H[DA fills correction form]
    F --> H
    G --> H
Loading

Primary online metrics:

Metric Definition Target
Self-service rate % of queries answered without human DA intervention ≥ 70%
Escalation rate % of queries where user gives up and asks a DA ≤ 20%
Query success rate % of agent-generated SQLs that execute without error ≥ 85%
Time-to-answer Median time from question to usable data < 1 min (vs hours waiting for DA)
Weekly active users Unique business users per week Growing week-over-week
Query complexity % of queries requiring JOINs / window functions vs single-table Increasing over time

The goal: Users who previously Slacked a DA for "active users this month" now get it themselves in 30 seconds. If they're still Slacking the DA, the tool has failed.


5. Knowledge Base Design

K = markdown wiki files, one folder per domain, managed with Git.

K/
├── domain-fs/
├── domain-paylater/
└── [one folder per domain]

Git provides versioning, review history, and conflict resolution natively — same principle as code review.

Stale knowledge: known limitation. Roadmap: knowledge graph approach (reference: gbrain, Garry Tan / YC) — knowledge entries carry explicit structure and relationships, enabling updates and invalidation as domains evolve.


6. Stakeholder Dashboard

Starting Friday May 15, 12:00 — one link, always accessible, updated weekly. The weekly report contains three things:

# Section Content
1 Technical (offline scores) G_core regression score + G_new coverage score, per domain, week-over-week
2 Adoption & efficiency (online) Self-service rate, escalation rate, query success rate, time-to-answer, weekly active business users, query complexity — daily + weekly trend
3 Documents / source code tracking Three sources: (1) this evaluation document, (2) the memory design document, (3) the codebase

7. Evaluating the Evaluator

The IFAB score for every query is produced by an LLM judge (LLM-as-a-Judge). This means every score in the Friday report depends on the judge being reliable. If the judge drifts or is inconsistent, regression detection silently breaks.

Who watches the watchers? We must continuously verify that the LLM judge is itself trustworthy.

Approach

flowchart LR
    A[Calibration set\n20–30 queries with\nhuman-validated IFAB scores] --> B[Run weekly\nalongside regression]
    B --> C[LLM judge scores\nthe calibration queries]
    C --> D{Agreement\nwith humans}
    D -->|≥ 90%| E[✅ Judge trusted\nFriday scores valid]
    D -->|< 90%| F[🚩 Judge drifted\nInvestigate before\ntrusting scores]
    F --> G[Re-tune judge prompt\nor escalate]
Loading

Mechanisms

Mechanism What it does Cadence
Calibration set Frozen queries with human-validated IFAB scores. Measures judge accuracy. Every Friday
Human spot-check DAs / team leaders hand-score a sample of LLM-judged queries — humans pick their own samples. Tracks human–LLM agreement rate. (Optional helper: a second LLM judge can flag disagreement cases as additional candidates — minor input, not the main selection.) Weekly sample

Dashboard signal

Human–LLM agreement rate is tracked as a system health metric on the stakeholder dashboard. If it drops below 90%, that week's IFAB scores are flagged as untrusted until investigated.


8. What We Need — Management Support

Three DA KPIs — targets open for input:

KPI Definition Target
Fix turnaround time Error flagged → DA submits correction (input needed from data managers)
Golden query production Queries added to G_new per domain per week (input needed from data managers)
Judge calibration / spot-check DA hand-scores sample queries to validate the LLM judge (Section 7) (input needed from data managers)

Seeking input from data managers and leadership. DA time must be formally allocated for this work to be sustainable.


Infrastructure: Langfuse self-hosted · BigQuery · Claude Sonnet 4.6


9. CTO Review — Online Evaluation: DA Deflection Without Silent Wrong Answers

The offline evaluation design above answers: can MOAT produce correct SQL for known questions?

Online evaluation answers a different and more important production question:

Did MOAT actually reduce DA workload and help business users get usable answers faster, without creating silent wrong decisions?

This distinction matters because a data agent can score well offline while still failing in production:

  • Business users ask vaguer questions than golden queries.
  • Users may not know enough SQL/business logic to detect a wrong answer.
  • Explicit feedback is sparse; most users will not click thumbs up/down.
  • A confident wrong answer can look like successful DA deflection unless we audit it.

So online evaluation should not be framed as "user satisfaction." For MOAT, it should be framed as safe DA deflection.


9.1 North-star online metric

The first online north-star metric should be:

Self-service resolution rate: the percentage of business-user questions answered without human DA intervention, within an acceptable safety envelope.

Formula:

self_service_resolution_rate =
  resolved_without_DA / total_business_user_queries

But "resolved" must be defined carefully. A query is not resolved merely because SQL executed. It is resolved only when the user reached a usable answer and did not escalate to a DA within the observation window.

Recommended definition for pilot:

Resolved =
  SQL executed successfully
  AND answer was shown to user
  AND user did not click “Need DA help” / “Report wrong”
  AND user did not escalate the same or similar question within 24 hours

The 24-hour window is a starting point. Some domains may need a shorter or longer window depending on how business teams work.


9.2 Query lifecycle instrumentation

Every production query should move through an explicit lifecycle. This gives us a funnel view instead of one vague success metric.

State Meaning Why it matters
asked User submitted a natural-language business question Denominator for all online metrics
clarification_needed Agent asks user to clarify scope/time/domain Detects ambiguous requests and product friction
sql_generated Agent produced SQL Measures language-to-SQL coverage
sql_dry_run_passed BigQuery dry-run succeeded Separates syntax/schema issues from execution issues
sql_executed Query ran successfully Technical reliability
answer_rendered User saw table/chart/summary Product delivery, not just backend success
answer_used User copied/exported/shared/saved or asked analytical follow-up Strong implicit usefulness signal
reported_wrong User/DA says answer is wrong Safety signal
need_da_help User requests DA help from the product Explicit escalation
abandoned User leaves without using answer or following up Likely failure or low value
escalated_to_DA Same/similar question goes to DA after MOAT answer Main online failure signal

This lifecycle should be logged with: user role, domain, timestamp, generated SQL hash, query id, session id, and whether the query is eligible for golden-set promotion.


9.3 Online funnel dashboard

Online evaluation should be displayed as a funnel:

Business questions asked
  ↓
Clarified or understood
  ↓
SQL generated
  ↓
SQL dry-run passed
  ↓
SQL executed
  ↓
Answer rendered
  ↓
Answer used
  ↓
No DA escalation within 24h

Example dashboard shape:

Funnel step Metric Pilot target
Questions asked Total weekly queries Growing week-over-week
SQL generated % of questions with generated SQL ≥ 90%
Dry-run passed % generated SQL passing dry-run ≥ 85%
Executed % generated SQL executing successfully ≥ 80%
Answer used % rendered answers copied/exported/shared/followed-up ≥ 40% initially
No escalation % answers not escalated to DA within 24h ≥ 60% initially
Reported wrong % answers reported wrong by user/DA ≤ 10% initially

Targets should start modestly. Early production evaluation should expose failure modes, not pretend maturity.


9.4 Defining escalation

Escalation is the core online failure event.

For the May 15 pilot, escalation can be explicit:

  • User clicks Need DA help.
  • User clicks Report wrong.
  • DA marks a MOAT answer as needing correction.

Later, escalation should include implicit signals:

  • User sends the same or semantically similar question to a DA channel after using MOAT.
  • User asks “can someone verify this?” in the DA/business channel.
  • User repeats the same question multiple times with different wording.
  • User abandons MOAT and resumes the old DA workflow.

The long-term system should connect MOAT query logs to Lark/Slack/DA workflow signals, but this is not required for the first release. The first release only needs explicit escalation buttons and DA correction capture.


9.5 Safety: deflection alone can be dangerous

A pure DA-deflection metric can be gamed by confident wrong answers.

Bad outcome:

MOAT gives wrong answer
→ user trusts it
→ user does not escalate
→ dashboard counts it as successful deflection
→ business decision is wrong

Therefore online evaluation needs two layers:

Layer Question Metrics
Efficiency Did MOAT reduce DA interruptions? self-service rate, escalation rate, time-to-answer, answer-used rate
Safety Were the deflected answers actually trustworthy? reported-wrong rate, DA audit sample, high-risk query review, online-offline correlation

Recommended safety controls:

  1. DA audit sample: every week, DAs review a random sample of “successful” non-escalated answers.
  2. High-risk query flag: queries involving money, compliance, leadership reporting, or large business decisions require stronger review or confidence display.
  3. Confidence and caveats: MOAT should show when definitions/scope are uncertain instead of pretending certainty.
  4. Correction path: every answer must have an easy “Report wrong / Need DA help” path that captures the query, SQL, result, and user explanation.

9.6 Online-to-offline flywheel

Online evaluation should feed offline evaluation, but not automatically update live knowledge.

Recommended loop:

Production query
  → success / failure / escalation observed
  → DA reviews failure or high-value success
  → candidate added to G_new
  → weekly offline evaluation catches regression/coverage
  → DA correction form proposes K update
  → domain team leader approves
  → K' promoted to K
  → future G_core version includes stable validated cases

Hard rule:

Do not update K directly from production failures without domain-lead approval.

This preserves trust and prevents the system from learning bad business definitions from noisy user behavior.


9.7 Online-offline correlation

A mature evaluation system should track whether offline scores predict online value.

For each domain, compare:

  • G_core IFAB score
  • G_new coverage score
  • SQL execution success rate
  • escalation rate
  • answer-used rate
  • reported-wrong rate

Key interpretation:

Pattern Meaning Action
Offline ↑, escalation ↓ Eval is aligned with real utility Continue current loop
Offline ↑, escalation flat/up Golden set is not representative Source more G_new from real production traces
Offline flat, online improves Users ask simpler questions than golden set Rebalance golden set by actual usage distribution
Offline high, reported-wrong high Dangerous silent-trust gap Increase DA audit and high-risk query gates

This correlation is the CTO health check for the entire evaluation program.


9.8 May 15 minimum online scope

For the May 15 release, do not overbuild online evaluation. Add only enough instrumentation to make the first dashboard honest.

Minimum scope:

  1. Log query lifecycle events: asked, sql_generated, sql_executed, answer_rendered.
  2. Add two explicit buttons: Need DA help and Report wrong.
  3. Track basic answer-use signals: copy, export, share, follow-up question.
  4. Define escalation window: 24 hours for pilot.
  5. Produce weekly online dashboard with: active users, query count, SQL success, time-to-answer, explicit escalation, reported-wrong, answer-used rate.
  6. Create DA correction intake from escalated/reported-wrong queries.

Defer until after May 15:

  • Lark/Slack semantic escalation detection.
  • Automated same-question matching.
  • Sophisticated confidence scoring.
  • Full online-offline correlation automation.

9.9 CTO summary recommendation

MOAT online evaluation should optimize for safe DA deflection, not generic satisfaction.

The online dashboard must answer three questions every week:

  1. Adoption: Are business users actually asking MOAT instead of DAs?
  2. Efficiency: Are they getting answers without DA intervention?
  3. Safety: Are those deflected answers trustworthy enough, or are we creating silent wrong decisions?

If we can answer these three questions, the evaluation system becomes a management instrument — not just an engineering benchmark.


10. Further Discussion Notes

Short notes from CTO consultant review. Not blockers for May 15.

1. Target audience correction: MOAT users are business people (marketing, product, bosses) who currently ask DAs for data — not DAs themselves. Online metrics must measure whether they self-serve or still bug DAs.

2. Golden query source: G_new should increasingly come from real user sessions and DA correction cases — not invented only by DAs. "Trace-to-golden" pipeline: log real queries → DA validates → promote accepted ones.

3. Judge calibration: Add precision/recall alongside agreement. A judge with 90% agreement can still miss real errors. Target: precision ≥ 85%, recall ≥ 80%.

4. Snapshot refresh: Refresh weekly for G_new, pinned for G_core (only changes on version bump). Document snapshot date in every Friday report.

5. Launch volume: 30-50 queries for G_core (~3-5 per domain). G_new grows 5-10 per domain/week. G_core v2 at month 1: 60-80 queries.


End of document. Questions and critiques welcome — this is a design draft, not a decree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment