Skip to content

Instantly share code, notes, and snippets.

@esz135888
Created May 23, 2026 20:33
Show Gist options
  • Select an option

  • Save esz135888/7b875dd94847f9001405ce3b38b10484 to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/7b875dd94847f9001405ce3b38b10484 to your computer and use it in GitHub Desktop.
PLS job 28e3e2ec AI prediction D14 correction router scorecard

Acceptance Tests

Test 1: D7 Run Ingest

Given a completed D7 calibration run, when D14 router starts, then it must ingest all miss, partial, unknown, and reviewer-disputed items.

Pass:

  • source_calibration_run_id exists.
  • non-hit/dispute count > 0 or explicit no-op record exists.
  • every ingested item links to original evidence refs.

Test 2: Route Completeness

Given the ingested non-hit list, when routes are generated, then every item must have exactly one active route or an explicit ignore reason.

Pass:

  • routed non-hit rate = 100%.
  • route type is one of the approved enum values.
  • owner, due date, evidence refs, and acceptance rule are not blank.

Test 3: Source Gap Block

Given source adapter gaps, when unresolved gap rate is above 20%, then D30 weekly scorecard must remain blocked.

Pass:

  • weekly_prediction_scorecard.adoption_gate=repair_first or blocked.
  • no dashboard/deployment artifact is marked production.
  • each source gap has adapter owner and due date.

Test 4: Reviewer and Rubric Verification

Given rubric_fix or reviewer dispute routes, when zihrou reviews them, then each must have before/after rubric criteria and sample outcome.

Pass:

  • repair_action.action_type=edit_rubric.
  • before/after state present.
  • at least 10 historical or sampled cases are used for re-check when available.

Test 5: Re-run Cohort

Given repaired routes, when re-run starts, then it must create a new cohort without overwriting original labels.

Pass:

  • rerun_cohort.status exists.
  • before/after unknown and miss rates are recorded.
  • rerun cohort has at least 10 items unless fewer repaired items exist and reason is recorded.

Test 6: Weekly Scorecard Adoption Gate

Given re-run results, when weekly scorecard is generated, then it can ship only if routing and gap gates pass.

Pass:

  • routed non-hit rate = 100%.
  • unresolved gap rate <=20%.
  • reviewer agreement rate >=80% or disputes are routed.
  • adoption gate is one of ship_weekly_scorecard, repair_first, blocked.

Verification Result for This Pack

Local structural verification passed when:

  • Primary HTML artifact exists.
  • Production brief includes D1/D7/D14/D30, purpose-to-purpose E2E, value/money path, human capability improvement, adoption path, and LINE draft.
  • Data model includes schema, APIs, sync rules, permissions, audit, PLS backend and worker flow.
  • Decision record exists.
  • Learning memory is valid JSON.

Artifact URL or PR

Durable primary artifact:

https://gist.github.com/esz135888/7b875dd94847f9001405ce3b38b10484

Gist id:

7b875dd94847f9001405ce3b38b10484

Verification:

  • Public Gist URL responds with HTTP 200 after redirects.
  • File list includes HTML primary artifact, production brief, data model, acceptance tests, decision record, learning memory, market sources, and this artifact record.
<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>AI 預測驗證 D14 Correction Router & Weekly Scorecard</title>
<style>
:root {
--ink:#18222d; --muted:#607080; --line:#d7dde5; --bg:#f7f8fb; --panel:#fff;
--blue:#2258c9; --green:#147a53; --amber:#9c6200; --red:#b3261e; --violet:#6d3fc7;
}
*{box-sizing:border-box}
body{margin:0;background:var(--bg);color:var(--ink);font-family:Inter,ui-sans-serif,system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;line-height:1.5}
header{padding:24px clamp(18px,4vw,48px);background:#fff;border-bottom:1px solid var(--line)}
h1{margin:0;font-size:clamp(24px,3vw,36px);letter-spacing:0}
h2{margin:0 0 12px;font-size:18px}
h3{margin:0 0 8px;font-size:15px}
p{margin:0 0 10px}
main{padding:22px clamp(18px,4vw,48px) 48px;display:grid;gap:16px}
.sub{color:var(--muted);max-width:1100px;margin-top:8px}
.grid{display:grid;gap:16px}.cols4{grid-template-columns:repeat(4,minmax(0,1fr))}.cols3{grid-template-columns:repeat(3,minmax(0,1fr))}.cols2{grid-template-columns:repeat(2,minmax(0,1fr))}
.panel{background:var(--panel);border:1px solid var(--line);border-radius:8px;padding:16px}
.metric{min-height:118px;display:flex;flex-direction:column;justify-content:space-between}
.label{font-size:13px;color:var(--muted)}.value{font-size:30px;font-weight:760}.ok{color:var(--green)}.warn{color:var(--amber)}.stop{color:var(--red)}
.tag{display:inline-flex;align-items:center;height:24px;padding:0 8px;border:1px solid var(--line);border-radius:999px;background:#fbfcfe;color:var(--muted);font-size:12px;margin-right:6px}
table{width:100%;border-collapse:collapse;font-size:13px}th,td{border-bottom:1px solid var(--line);padding:10px 8px;text-align:left;vertical-align:top}th{background:#fbfcfe;color:var(--muted)}
code{background:#eef2f7;border-radius:4px;padding:2px 5px;font-size:12px}
.lane{border-left:4px solid var(--blue);padding-left:12px}.lane:nth-child(2){border-color:var(--amber)}.lane:nth-child(3){border-color:var(--violet)}.lane:nth-child(4){border-color:var(--green)}
ul{padding-left:18px;margin:0}li{margin:6px 0}
@media(max-width:980px){.cols4,.cols3,.cols2{grid-template-columns:1fr}}
</style>
</head>
<body>
<header>
<h1>D14 Correction Router & Weekly Scorecard</h1>
<p class="sub">接續 D7 calibration run:把 miss、partial、unknown、reviewer dispute 轉成可派工的修復路由,並定義 D30 weekly scorecard 上線前的資料、權限、稽核與採用門檻。</p>
<p><span class="tag">Owner: Louis</span><span class="tag">Review: zihrou / iron</span><span class="tag">Due: 2026-06-07</span><span class="tag">No route, no close</span></p>
</header>
<main>
<section class="grid cols4">
<div class="panel metric"><span class="label">D14 Goal</span><span class="value">100%</span><span class="label">non-hit items routed</span></div>
<div class="panel metric"><span class="label">Source Gap SLA</span><span class="value warn">7d</span><span class="label">adapter owner must respond</span></div>
<div class="panel metric"><span class="label">Re-run Cohort</span><span class="value ok">>=10</span><span class="label">fixed cases before D30 scorecard</span></div>
<div class="panel metric"><span class="label">Release Gate</span><span class="value stop">Block</span><span class="label">if unresolved gap > 20%</span></div>
</section>
<section class="panel">
<h2>30 天推進路徑</h2>
<div class="grid cols4">
<div class="lane"><h3>D1</h3><p>讀取 D7 run scorecard,鎖定所有 miss、partial、unknown、dispute。</p></div>
<div class="lane"><h3>D7</h3><p>完成路由分類:rubric fix、source adapter gap、owner follow-up、model prompt fix。</p></div>
<div class="lane"><h3>D14</h3><p>每個路由有 owner、due、evidence、repair action,並啟動 re-run cohort。</p></div>
<div class="lane"><h3>D30</h3><p>週報 scorecard 上線:accuracy trend、unknown trend、repair cycle time、adoption gate。</p></div>
</div>
</section>
<section class="panel">
<h2>目的到目的 E2E</h2>
<table>
<tr><th>階段</th><th>輸入</th><th>產出</th><th>人如何採用</th><th>指標改善</th></tr>
<tr><td>原始目的</td><td>上次 review 預測與多來源 evidence</td><td>D7 labels</td><td>Louis 看是否可信</td><td>降低 false confidence</td></tr>
<tr><td>D14 修復</td><td>miss/unknown/dispute</td><td><code>correction_route</code></td><td>zihrou/iron 確認責任歸因</td><td>縮短人工追查時間</td></tr>
<tr><td>D30 採用</td><td>已修復 cohort 與 re-run 結果</td><td><code>weekly_scorecard</code></td><td>PLS 推送週報與下一輪派工</td><td>提高 AI review 可治理性</td></tr>
</table>
</section>
<section class="grid cols2">
<div class="panel">
<h2>Correction Router</h2>
<table>
<tr><th>Route</th><th>When</th><th>Owner</th><th>Acceptance</th></tr>
<tr><td>rubric_fix</td><td>prediction wording or success criteria was ambiguous</td><td>zihrou</td><td>new rubric tested on >=10 historical cases</td></tr>
<tr><td>source_adapter_gap</td><td>evidence exists but signal/action-item sync missed it</td><td>iron</td><td>adapter maps source id, timestamp, extractor version</td></tr>
<tr><td>owner_followup</td><td>human action status is unknown</td><td>Louis delegate</td><td>owner replies done / blocked / rejected</td></tr>
<tr><td>model_prompt_fix</td><td>model overpredicted from weak signal</td><td>PLS worker</td><td>prompt version and before/after eval recorded</td></tr>
</table>
</div>
<div class="panel">
<h2>價值 / 錢路徑</h2>
<ul>
<li>營收:只把通過驗證的 AI review 模式放進業務或管理流程,避免錯誤建議拉低成交與決策品質。</li>
<li>省成本:用 route taxonomy 取代逐案人工討論,讓 reviewer 只處理高風險樣本。</li>
<li>降風險:未知率與 source gap 未修前禁止 dashboard 化,避免漂亮圖表掩蓋資料缺口。</li>
<li>釋放人力:把每個 miss 直接變成 owner/due/acceptance 的下一步,不再靠人追問。</li>
</ul>
</div>
</section>
<section class="grid cols3">
<div class="panel"><h2>Data / API</h2><p>新增 <code>correction_route</code>、<code>repair_action</code>、<code>rerun_cohort</code>、<code>weekly_prediction_scorecard</code>。API: <code>POST /routes/bulk</code>、<code>PATCH /routes/:id</code>、<code>POST /reruns</code>、<code>GET /weekly-scorecard</code>。</p></div>
<div class="panel"><h2>權限 / 稽核</h2><p>Owner 可 close route;reviewer 可改 dispute status;worker 可寫路由與 re-run 結果。每筆修改保留 evidence hash、actor、timestamp、model/prompt version。</p></div>
<div class="panel"><h2>提升人的能力</h2><p>Louis 得到 go/no-go 節奏;zihrou 把模糊預測變 rubric;iron 把資料缺口變 adapter backlog;PLS worker 知道下一輪要修哪一類。</p></div>
</section>
<section class="panel">
<h2>LINE 草稿</h2>
<p>AI 預測驗證進入 D14 修復路由。請 Louis 確認 2026-06-07 前所有 D7 non-hit 都要有 owner/due/next action;zihrou 看 rubric_fix;iron 看 source_adapter_gap。若 unresolved gap >20%,D30 weekly scorecard 不上線,只派 source/rubric 修復。</p>
</section>
</main>
</body>
</html>

Data Model / API / 權限稽核

新增資料表

correction_route

欄位 型別 必填 說明
id uuid yes route id
calibration_run_id uuid yes 來源 D7 run
calibration_run_item_id uuid yes non-hit/dispute item
route_type enum yes rubric_fix, source_adapter_gap, owner_followup, model_prompt_fix, ignore_with_reason
severity enum yes P0, P1, P2
owner_user_id uuid yes route owner
due_at datetime yes 修復期限
evidence_refs jsonb yes signals/action items/reviewer sample refs
acceptance_rule text yes 如何判定修好
status enum yes open, in_progress, ready_for_rerun, verified, closed, blocked
audit_ref text yes decision record / worker run id

repair_action

欄位 型別 必填 說明
id uuid yes action id
correction_route_id uuid yes parent route
action_type enum yes edit_rubric, fix_adapter, ask_owner, update_prompt, document_ignore
before_state jsonb yes 修復前 evidence/rubric/prompt/source 狀態
after_state jsonb no 修復後狀態
actor_user_or_worker_id text yes 人或 worker
completed_at datetime no 完成時間

rerun_cohort

欄位 型別 必填 說明
id uuid yes cohort id
source_calibration_run_id uuid yes 原 D7 run
route_ids uuid[] yes 要重跑的 routes
status enum yes planned, running, passed, failed
before_unknown_rate decimal yes 修復前 unknown
after_unknown_rate decimal no 修復後 unknown
before_miss_rate decimal yes 修復前 miss
after_miss_rate decimal no 修復後 miss

weekly_prediction_scorecard

欄位 型別 必填 說明
week_start date yes 週期
project_id uuid yes PLS project
calibration_run_id uuid yes 最新 run
routed_non_hit_rate decimal yes 目標 1.0
unresolved_gap_rate decimal yes 必須 <=0.2
reviewer_agreement_rate decimal yes 目標 >=0.8
rerun_improvement_delta decimal yes re-run 改善幅度
adoption_gate enum yes ship_weekly_scorecard, repair_first, blocked

API / Sync Spec

API Method 用途
/ai-prediction/correction-routes/bulk POST 從 D7 run 批次建立 routes
/ai-prediction/correction-routes/:id PATCH 更新 owner、status、acceptance、evidence
/ai-prediction/repair-actions POST 寫入修復動作
/ai-prediction/rerun-cohorts POST 建立修復後 re-run cohort
/ai-prediction/weekly-scorecards/:project_id GET 給 PLS 後台/LINE 取週報

Sync rules:

  • 每個 route 必須保留 D7 label、evidence refs、reviewer sample result。
  • source adapter gap 必須包含 source type、source id、last_seen_at、extractor version。
  • prompt/rubric 修復必須保留 before/after version。
  • re-run 不覆寫原 D7 labels,只新增 cohort 結果。

權限 / 稽核邊界

  • Louis 可 approve weekly scorecard adoption gate。
  • zihrou 可 verify rubric_fix 與 reviewer dispute。
  • iron 可 verify source_adapter_gap
  • PLS worker 可建立 route、寫 repair action、建立 re-run cohort,但不可直接 close route。
  • 每筆寫入保存 actor、worker id、timestamp、model/prompt version、evidence hash。

PLS 後台 / Worker 流程

Worker 啟動時先查最新 calibration_run.statusweekly_prediction_scorecard.adoption_gate

  • 若沒有 D7 run:回到 D7 calibration run。
  • 若 D7 run 有 non-hit 未 route:執行 D14 route。
  • 若 route 已修復但未 re-run:建立 rerun cohort。
  • 若 unresolved gap <=20% 且 re-run 改善為正:生成 D30 weekly scorecard。
  • 若 unresolved gap >20%:只派 source/rubric 修復,不准上線 dashboard。

Decision Record: D14 Correction Router

Date: 2026-05-24
Status: Recommended
Owner: Louis
Reviewers: zihrou / iron

Problem

D7 calibration can show whether predictions hit, missed, partially matched, or remain unknown. But without D14 correction routing, the project still cannot improve the system. A scorecard without repair routing would only describe failure.

Options

Option A: Build weekly dashboard immediately

Pros: easy to present.
Cons: dashboard may hide unresolved evidence gaps and create false confidence.

Option B: Ask reviewers to manually discuss each miss

Pros: uses human judgment.
Cons: slow, not scalable, and creates no worker-readable state.

Option C: Build correction router before weekly scorecard

Pros: turns every miss/source gap into owner/due/acceptance, supports re-run cohorts, and gives the weekly scorecard a real adoption gate.
Cons: needs stricter data model and route ownership.

Recommendation

Choose Option C. It is the right production step after D7 because it converts measurement into improvement and prevents dashboard-first theater.

Adopted Path

  1. Ingest D7 non-hit/dispute items.
  2. Route every item to rubric, source adapter, owner follow-up, prompt fix, or documented ignore.
  3. Require owner/due/evidence/acceptance.
  4. Re-run repaired cohort.
  5. Ship D30 weekly scorecard only if unresolved gap <=20%.

If Rejected, Required Feedback

The reviewer must specify which assumption failed:

  • D7 run was not completed.
  • non-hit evidence refs are missing.
  • zihrou/iron cannot own rubric/source review.
  • PLS backend cannot store correction routes.
  • unresolved gap threshold should change.

Without one of these concrete objections, the project should proceed with D14 routing.

{
"job_id": "28e3e2ec-b385-42a7-92a7-421129fe81a7",
"project_topic": "AI prediction verification module for signals and action-item evidence",
"current_artifact": "D14 Correction Router and Weekly Scorecard Pack",
"previous_artifact": "D7 Calibration Run Control Tower",
"owner": "Louis",
"reviewers": ["zihrou", "iron"],
"due": "2026-06-07",
"market_learning": [
"Current AI observability practice combines traces, evals, reviewer annotations, production examples, experiments, and monitoring scorecards.",
"OpenTelemetry GenAI conventions suggest preserving standard trace/evidence attributes rather than vendor-specific logs only.",
"Phoenix, Evidently, and LangSmith all point toward continuous evaluation and production monitoring rather than one-time reports."
],
"next_worker_rule": {
"if_no_d7_run": "Return to D7 Calibration Run Control Tower and create calibration_run first.",
"if_non_hits_unrouted": "Create correction_route records for every miss, partial, unknown, and dispute.",
"if_unresolved_gap_rate_gt_20_percent": "Do not ship weekly scorecard; dispatch source_adapter_gap and rubric_fix tasks.",
"if_routes_repaired_but_not_rerun": "Create rerun_cohort and compare before/after unknown and miss rates.",
"if_gates_pass": "Build D30 weekly scorecard/dashboard for PLS backend and LINE cadence."
},
"acceptance_gate": {
"routed_non_hit_rate": 1.0,
"unresolved_gap_rate_max": 0.2,
"rerun_cohort_min": 10,
"reviewer_agreement_target": 0.8
},
"do_not_repeat": [
"Do not build another generic AI prediction verification pack.",
"Do not build dashboard before correction routes and unresolved gap gate are checked.",
"Do not close a route without owner, due date, evidence refs, and acceptance rule."
]
}

AI 預測驗證 D14 Correction Router & Weekly Scorecard Pack

場景

上一輪已建立 D7 Calibration Run Control Tower。本輪不重做概念,而是補上 D14 修復路由與 D30 weekly scorecard 的 production layer,讓 D7 的 miss、partial、unknown、reviewer dispute 能直接變成 owner/due/acceptance 的修復任務。

Owner: Louis
Reviewers: zihrou / iron
Due: 2026-06-07
Primary artifact: d14-correction-router-scorecard.html

30 天發展路徑

時點 應長成什麼樣子 驗收
D1 讀取 D7 run scorecard,鎖定所有 non-hit/dispute item。 D7 run id、scorecard、non-hit list 可追溯。
D7 完成 correction route taxonomy。 100% non-hit 有 route type、owner、due、evidence。
D14 啟動修復 action 與 re-run cohort。 至少 10 件修復後 re-run,未解 gap 不超過 20%。
D30 Weekly scorecard 可進 PLS 後台與 LINE 節奏。 accuracy/unknown/repair cycle/adoption gate 可週更。

目的到目的 E2E

原始目的:自動核對上次 review 的預測是否命中。
本輪目的:把沒有命中的原因轉成可修復、可派工、可驗收的改善迴路。

E2E:

  1. D7 calibration run 產出 hit/miss/partial/unknown/dispute。
  2. D14 router 將每個 non-hit 分派為 rubric_fixsource_adapter_gapowner_followupmodel_prompt_fixignore_with_reason
  3. 每個 route 都有 owner、due、evidence refs、acceptance rule。
  4. 修復後建立 rerun_cohort,檢查是否降低 unknown rate 與 miss recurrence。
  5. 通過後才開 D30 weekly scorecard;未通過則回到 source/rubric 修復。

價值 / 錢路徑

  • 營收:只把通過驗證的 AI review 模式放進業務與管理決策,降低錯誤建議造成的成交損失。
  • 成本:把逐案討論變成 route taxonomy 和 re-run cohort,減少 reviewer 重複追查。
  • 風險:unknown/source gap 未修時阻擋 dashboard 上線,避免管理層被不完整資料誤導。
  • 轉換:讓 Louis 能用 weekly scorecard 決定哪些 AI automation 值得擴大採用。
  • 釋放人力:PLS worker 可直接依 route type 派下一輪,不必重新讀全部脈絡。

提升人的能力

  • Louis:從「感覺 AI 有沒有用」升級成用 scorecard 管理 AI review 品質。
  • zihrou:把模糊預測修成可教、可審、可重跑的 rubric。
  • iron:把 source gap 變成 adapter backlog,而不是零散追問。
  • PLS worker:用 learning memory 判斷下一輪是 re-run、修 source,還是進 weekly scorecard。

Solution Stack

交付內容
脈絡框架 D7 non-hit/dispute -> D14 correction route -> D30 weekly scorecard。
作業流程 scorecard ingest -> route assignment -> owner repair -> re-run cohort -> adoption gate。
資料模型 correction_routerepair_actionrerun_cohortweekly_prediction_scorecard
可操作工具 HTML control tower、資料模型、驗收測試、LINE 草稿、learning memory。
驗收指標 100% non-hit routed、unresolved gap <=20%、re-run cohort >=10、scorecard schema ready。
採用升級 通過 D14 後進 PLS 後台週報;未通過則只派 source/rubric 修復。

市場脈絡

2026 年成熟做法不是只看聊天紀錄,而是把 AI 系統的 traces、evals、datasets、annotations、scorecards 和 production monitoring 接成閉環。OpenTelemetry GenAI semantic conventions 提供跨工具 trace 語彙;Phoenix 強調 traces、eval tests、production examples 與 experiments;Evidently 提供 LLM/ML evaluation and monitoring;LangSmith 則強調 production online evals。這些做法共同指向:PLS 應該保留 evidence refs、run ids、reviewer annotations、route outcomes,而不是只產生摘要。

Market Context Sources

Checked on 2026-05-24 Asia/Taipei.

Sources

Takeaway

The comparable mature pattern is not a static report. It is traceable evidence plus evals, reviewer annotations, experiments/re-runs, and production monitoring. PLS should therefore promote D7 calibration labels into D14 correction routes and D30 weekly scorecards with auditable evidence refs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment