Version: v10 | Prepared by: Son Pham | Date: 2026-05-08
MOAT is MoMo's AI data agent that answers business questions in natural language against BigQuery. This document describes the evaluation framework — both how we measure quality (IFAB) and how we continuously improve without degrading what already works.
Key points:
- Evaluation starts Friday May 15, 12:00 — first release for feedback. Everyone invited to critique.
- Two evaluation modes: offline (controlled, against known queries) and online (live production signals)
- The knowledge base is human-gated: domain team leader approves all updates
- DA time allocation is open — input needed from data managers and leadership
This is a design document. KPIs, milestones, and resource targets are deliberately open — your input is needed to complete the next version.
IFAB evaluates query logic, not raw numbers — so it works even when result snapshots are time-sensitive. Adapted from existing frameworks, refined for MoMo's use case. Weights are a starting point, being validated with the data team.
| Component | Points | What it checks | Why it matters |
|---|---|---|---|
| I — Intent | 4 | Did the agent understand what the user is actually asking for? | Highest weight — if intent is wrong, the answer is confidently wrong. Worst failure mode. |
| F — Filter | 3 | Did it apply the right scope (which records / time range / segment to include)? | Wrong filter = right shape, wrong slice. Numbers look reasonable but answer a different question. |
| A — Aggregation | 2 | Did it summarize the data correctly (totals vs averages, by-customer vs overall)? | Wrong aggregation makes the result meaningless even if filters and intent are right. |
| B — Business rule | 1 | Did it apply MoMo-specific definitions (how "active user", "transaction", etc. are defined here)? | Lowest weight because business rules are the most learnable — DAs can teach them via corrections. Not foundational, but essential for production trust. |
| Returned Numbers | 0–3 | Exact / close / wrong / completely wrong | The reality check — does the final number match the golden answer. |
| Total | 13 | Weighted 70% IFAB : 30% Numbers |
| Set | Definition | Rule |
|---|---|---|
| G_core | Frozen golden queries — never changes | Regression baseline. Comparable forever. |
| G_new | Growing set — DAs add queries weekly | Coverage expansion. Tracks new domains. |
Hard rule: G_core failures must NOT be used as direct input to update K. Otherwise the agent just memorizes the test. K updates come from G_new queries and online signals — G_core stays a clean held-out benchmark.
G_core is immutable — same queries every Friday. This is what makes week-over-week comparison meaningful.
G_new grows weekly as DA teams add new queries per domain (~60 domains across the company, coordinated by the data managers). Domain coverage is a manager-level responsibility.
flowchart LR
A[DA Teams\nadd to G_new weekly] --> B[(G_core — frozen\nG_new — growing)]
B --> C[Friday 12:00\nAuto-run]
C --> D[MOAT scores\neach query — IFAB]
D --> E[Regression score\nIFAB on G_core\nweek-over-week]
D --> F[Coverage score\nIFAB on G_new\nnew domains]
E --> G[Management Report\nflagged regressions]
F --> G
D --> H[DA Report\npoorly scored queries\n+ AI diagnosis]
H --> I[DA fills correction form]
Per domain: regression score on G_core, week-over-week delta, coverage score on G_new. Domains with regressions or low absolute scores are flagged automatically.
| Domain | Regression (G_core) | vs last week | Coverage (G_new) | Status |
|---|---|---|---|---|
| Túi Thần Tài | 95% | +1% | 92% | ✅ |
| Vay Nhanh | 91% | +0% | 88% | ✅ |
| Bank Partnership | 89% | -1% | 85% | ✅ |
| VTS / PayLater | 72% | -8% | 60% | 🚩 Regression |
| … | … | … | … | (60 domains total — onboarding progressively) |
How to read this table:
- Regression (G_core) = score on the frozen golden set. The honest signal of whether quality is holding.
- vs last week = the comparable delta. Green if stable or improving, red if dropping.
- Coverage (G_new) = score on this week's new queries. Tracks how well the agent handles fresh domain knowledge.
- 🚩 Regression triggers an automatic alert — the domain manager and team leader investigate immediately.
When a query scores poorly, the DA submits a correction. The promotion to live K is gated by the domain team leader, not an automated check.
| Field | Content |
|---|---|
| Golden SQL | The correct query that should have been generated |
| Explanation | Why the agent got it wrong |
| Agent log (optional) | Agent reasoning steps — provided automatically |
flowchart LR
A[DA correction form] --> B[LLM extracts\nstructured knowledge]
B --> C[(K' — staging)]
C --> D[Agent re-tests\nonly the failing query q]
D -->|Pass| E[Team leader\nreviews knowledge content\n+ approves]
D -->|Fail| A
E --> F[(K — live system)]
Why no automated G_core check here? Running the full agent across G_core for every correction would mean hundreds of LLM calls per fix — too expensive, too slow, and unnecessary. The team leader's domain expertise is the quality gate. Friday's G_core regression is the safety net that catches any drift introduced during the week.
Conflict resolution: Domain team leader resolves conflicts between DA submissions.
Every 1 month: G_core is expanded by promoting stable queries from G_new. When this happens, the version is bumped — scores before and after a version bump are not directly comparable.
G_core_v2 = G_core_v1 ∪ { validated queries from G_new }
What it is: Monitoring real business user queries in production between weekly tests. MOAT users are not Data Analysts — they are marketing execs, product owners, and bosses who currently ask DAs for data.
Note on feedback signals: Thumb up/down is meaningless for internal tools (~5–8% in consumer apps, near-zero here). We measure whether users get answers without bothering DAs.
flowchart LR
A[Business user sends query] --> B[MOAT answers]
B --> C{System signals}
C -->|Got answer?| D[Log success]
C -->|Gave up?| E[Log escalation to DA]
C -->|/feedback command| F[Routed to Domain DA]
B -->|SQL failed?| G[Log error]
E --> H[DA fills correction form]
F --> H
G --> H
Primary online metrics:
| Metric | Definition | Target |
|---|---|---|
| Self-service rate | % of queries answered without human DA intervention | ≥ 70% |
| Escalation rate | % of queries where user gives up and asks a DA | ≤ 20% |
| Query success rate | % of agent-generated SQLs that execute without error | ≥ 85% |
| Time-to-answer | Median time from question to usable data | < 1 min (vs hours waiting for DA) |
| Weekly active users | Unique business users per week | Growing week-over-week |
| Query complexity | % of queries requiring JOINs / window functions vs single-table | Increasing over time |
The goal: Users who previously Slacked a DA for "active users this month" now get it themselves in 30 seconds. If they're still Slacking the DA, the tool has failed.
K = markdown wiki files, one folder per domain, managed with Git.
K/
├── domain-fs/
├── domain-paylater/
└── [one folder per domain]
Git provides versioning, review history, and conflict resolution natively — same principle as code review.
Stale knowledge: known limitation. Roadmap: knowledge graph approach (reference: gbrain, Garry Tan / YC) — knowledge entries carry explicit structure and relationships, enabling updates and invalidation as domains evolve.
Starting Friday May 15, 12:00 — one link, always accessible, updated weekly. The weekly report contains three things:
| # | Section | Content |
|---|---|---|
| 1 | Technical (offline scores) | G_core regression score + G_new coverage score, per domain, week-over-week |
| 2 | Adoption & efficiency (online) | Self-service rate, escalation rate, query success rate, time-to-answer, weekly active business users, query complexity — daily + weekly trend |
| 3 | Documents / source code tracking | Three sources: (1) this evaluation document, (2) the memory design document, (3) the codebase |
The IFAB score for every query is produced by an LLM judge (LLM-as-a-Judge). This means every score in the Friday report depends on the judge being reliable. If the judge drifts or is inconsistent, regression detection silently breaks.
Who watches the watchers? We must continuously verify that the LLM judge is itself trustworthy.
flowchart LR
A[Calibration set\n20–30 queries with\nhuman-validated IFAB scores] --> B[Run weekly\nalongside regression]
B --> C[LLM judge scores\nthe calibration queries]
C --> D{Agreement\nwith humans}
D -->|≥ 90%| E[✅ Judge trusted\nFriday scores valid]
D -->|< 90%| F[🚩 Judge drifted\nInvestigate before\ntrusting scores]
F --> G[Re-tune judge prompt\nor escalate]
| Mechanism | What it does | Cadence |
|---|---|---|
| Calibration set | Frozen queries with human-validated IFAB scores. Measures judge accuracy. | Every Friday |
| Human spot-check | DAs / team leaders hand-score a sample of LLM-judged queries — humans pick their own samples. Tracks human–LLM agreement rate. (Optional helper: a second LLM judge can flag disagreement cases as additional candidates — minor input, not the main selection.) | Weekly sample |
Human–LLM agreement rate is tracked as a system health metric on the stakeholder dashboard. If it drops below 90%, that week's IFAB scores are flagged as untrusted until investigated.
Three DA KPIs — targets open for input:
| KPI | Definition | Target |
|---|---|---|
| Fix turnaround time | Error flagged → DA submits correction | (input needed from data managers) |
| Golden query production | Queries added to G_new per domain per week | (input needed from data managers) |
| Judge calibration / spot-check | DA hand-scores sample queries to validate the LLM judge (Section 7) | (input needed from data managers) |
Seeking input from data managers and leadership. DA time must be formally allocated for this work to be sustainable.
Infrastructure: Langfuse self-hosted · BigQuery · Claude Sonnet 4.6
The offline evaluation design above answers: can MOAT produce correct SQL for known questions?
Online evaluation answers a different and more important production question:
Did MOAT actually reduce DA workload and help business users get usable answers faster, without creating silent wrong decisions?
This distinction matters because a data agent can score well offline while still failing in production:
- Business users ask vaguer questions than golden queries.
- Users may not know enough SQL/business logic to detect a wrong answer.
- Explicit feedback is sparse; most users will not click thumbs up/down.
- A confident wrong answer can look like successful DA deflection unless we audit it.
So online evaluation should not be framed as "user satisfaction." For MOAT, it should be framed as safe DA deflection.
The first online north-star metric should be:
Self-service resolution rate: the percentage of business-user questions answered without human DA intervention, within an acceptable safety envelope.
Formula:
self_service_resolution_rate =
resolved_without_DA / total_business_user_queries
But "resolved" must be defined carefully. A query is not resolved merely because SQL executed. It is resolved only when the user reached a usable answer and did not escalate to a DA within the observation window.
Recommended definition for pilot:
Resolved =
SQL executed successfully
AND answer was shown to user
AND user did not click “Need DA help” / “Report wrong”
AND user did not escalate the same or similar question within 24 hours
The 24-hour window is a starting point. Some domains may need a shorter or longer window depending on how business teams work.
Every production query should move through an explicit lifecycle. This gives us a funnel view instead of one vague success metric.
| State | Meaning | Why it matters |
|---|---|---|
asked |
User submitted a natural-language business question | Denominator for all online metrics |
clarification_needed |
Agent asks user to clarify scope/time/domain | Detects ambiguous requests and product friction |
sql_generated |
Agent produced SQL | Measures language-to-SQL coverage |
sql_dry_run_passed |
BigQuery dry-run succeeded | Separates syntax/schema issues from execution issues |
sql_executed |
Query ran successfully | Technical reliability |
answer_rendered |
User saw table/chart/summary | Product delivery, not just backend success |
answer_used |
User copied/exported/shared/saved or asked analytical follow-up | Strong implicit usefulness signal |
reported_wrong |
User/DA says answer is wrong | Safety signal |
need_da_help |
User requests DA help from the product | Explicit escalation |
abandoned |
User leaves without using answer or following up | Likely failure or low value |
escalated_to_DA |
Same/similar question goes to DA after MOAT answer | Main online failure signal |
This lifecycle should be logged with: user role, domain, timestamp, generated SQL hash, query id, session id, and whether the query is eligible for golden-set promotion.
Online evaluation should be displayed as a funnel:
Business questions asked
↓
Clarified or understood
↓
SQL generated
↓
SQL dry-run passed
↓
SQL executed
↓
Answer rendered
↓
Answer used
↓
No DA escalation within 24h
Example dashboard shape:
| Funnel step | Metric | Pilot target |
|---|---|---|
| Questions asked | Total weekly queries | Growing week-over-week |
| SQL generated | % of questions with generated SQL | ≥ 90% |
| Dry-run passed | % generated SQL passing dry-run | ≥ 85% |
| Executed | % generated SQL executing successfully | ≥ 80% |
| Answer used | % rendered answers copied/exported/shared/followed-up | ≥ 40% initially |
| No escalation | % answers not escalated to DA within 24h | ≥ 60% initially |
| Reported wrong | % answers reported wrong by user/DA | ≤ 10% initially |
Targets should start modestly. Early production evaluation should expose failure modes, not pretend maturity.
Escalation is the core online failure event.
For the May 15 pilot, escalation can be explicit:
- User clicks Need DA help.
- User clicks Report wrong.
- DA marks a MOAT answer as needing correction.
Later, escalation should include implicit signals:
- User sends the same or semantically similar question to a DA channel after using MOAT.
- User asks “can someone verify this?” in the DA/business channel.
- User repeats the same question multiple times with different wording.
- User abandons MOAT and resumes the old DA workflow.
The long-term system should connect MOAT query logs to Lark/Slack/DA workflow signals, but this is not required for the first release. The first release only needs explicit escalation buttons and DA correction capture.
A pure DA-deflection metric can be gamed by confident wrong answers.
Bad outcome:
MOAT gives wrong answer
→ user trusts it
→ user does not escalate
→ dashboard counts it as successful deflection
→ business decision is wrong
Therefore online evaluation needs two layers:
| Layer | Question | Metrics |
|---|---|---|
| Efficiency | Did MOAT reduce DA interruptions? | self-service rate, escalation rate, time-to-answer, answer-used rate |
| Safety | Were the deflected answers actually trustworthy? | reported-wrong rate, DA audit sample, high-risk query review, online-offline correlation |
Recommended safety controls:
- DA audit sample: every week, DAs review a random sample of “successful” non-escalated answers.
- High-risk query flag: queries involving money, compliance, leadership reporting, or large business decisions require stronger review or confidence display.
- Confidence and caveats: MOAT should show when definitions/scope are uncertain instead of pretending certainty.
- Correction path: every answer must have an easy “Report wrong / Need DA help” path that captures the query, SQL, result, and user explanation.
Online evaluation should feed offline evaluation, but not automatically update live knowledge.
Recommended loop:
Production query
→ success / failure / escalation observed
→ DA reviews failure or high-value success
→ candidate added to G_new
→ weekly offline evaluation catches regression/coverage
→ DA correction form proposes K update
→ domain team leader approves
→ K' promoted to K
→ future G_core version includes stable validated cases
Hard rule:
Do not update K directly from production failures without domain-lead approval.
This preserves trust and prevents the system from learning bad business definitions from noisy user behavior.
A mature evaluation system should track whether offline scores predict online value.
For each domain, compare:
- G_core IFAB score
- G_new coverage score
- SQL execution success rate
- escalation rate
- answer-used rate
- reported-wrong rate
Key interpretation:
| Pattern | Meaning | Action |
|---|---|---|
| Offline ↑, escalation ↓ | Eval is aligned with real utility | Continue current loop |
| Offline ↑, escalation flat/up | Golden set is not representative | Source more G_new from real production traces |
| Offline flat, online improves | Users ask simpler questions than golden set | Rebalance golden set by actual usage distribution |
| Offline high, reported-wrong high | Dangerous silent-trust gap | Increase DA audit and high-risk query gates |
This correlation is the CTO health check for the entire evaluation program.
For the May 15 release, do not overbuild online evaluation. Add only enough instrumentation to make the first dashboard honest.
Minimum scope:
- Log query lifecycle events:
asked,sql_generated,sql_executed,answer_rendered. - Add two explicit buttons: Need DA help and Report wrong.
- Track basic answer-use signals: copy, export, share, follow-up question.
- Define escalation window: 24 hours for pilot.
- Produce weekly online dashboard with: active users, query count, SQL success, time-to-answer, explicit escalation, reported-wrong, answer-used rate.
- Create DA correction intake from escalated/reported-wrong queries.
Defer until after May 15:
- Lark/Slack semantic escalation detection.
- Automated same-question matching.
- Sophisticated confidence scoring.
- Full online-offline correlation automation.
MOAT online evaluation should optimize for safe DA deflection, not generic satisfaction.
The online dashboard must answer three questions every week:
- Adoption: Are business users actually asking MOAT instead of DAs?
- Efficiency: Are they getting answers without DA intervention?
- Safety: Are those deflected answers trustworthy enough, or are we creating silent wrong decisions?
If we can answer these three questions, the evaluation system becomes a management instrument — not just an engineering benchmark.
Short notes from CTO consultant review. Not blockers for May 15.
1. Target audience correction: MOAT users are business people (marketing, product, bosses) who currently ask DAs for data — not DAs themselves. Online metrics must measure whether they self-serve or still bug DAs.
2. Golden query source: G_new should increasingly come from real user sessions and DA correction cases — not invented only by DAs. "Trace-to-golden" pipeline: log real queries → DA validates → promote accepted ones.
3. Judge calibration: Add precision/recall alongside agreement. A judge with 90% agreement can still miss real errors. Target: precision ≥ 85%, recall ≥ 80%.
4. Snapshot refresh: Refresh weekly for G_new, pinned for G_core (only changes on version bump). Document snapshot date in every Friday report.
5. Launch volume: 30-50 queries for G_core (~3-5 per domain). G_new grows 5-10 per domain/week. G_core v2 at month 1: 60-80 queries.
End of document. Questions and critiques welcome — this is a design draft, not a decree.