Skip to content

Instantly share code, notes, and snippets.

@esz135888
Last active May 24, 2026 01:44
Show Gist options
  • Select an option

  • Save esz135888/55b2c848bad8e2facd20c80efff8e38d to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/55b2c848bad8e2facd20c80efff8e38d to your computer and use it in GitHub Desktop.
PLS job 0a46b6bc AI prediction validation production pack

Acceptance Tests

  1. Primary artifact URL returns HTTP 200.
  2. Required appendix files exist: production brief, data model, acceptance tests, decision record, artifact URL record.
  3. D1 / D7 / D14 / D30 path exists.
  4. Purpose-to-purpose E2E maps review prediction to evidence, verdict, calibration, and next action.
  5. Every prediction bet has owner, due, expected outcome, confidence, and route.
  6. Verdict must be one of hit, partial, miss, unknown.
  7. Unknown verdicts must create a data_gap next action and cannot count as hit.
  8. Data model includes schema, API, permissions, sync, and audit.
  9. Market maturity includes at least two external comparable practices.
  10. People sync and learning memory exist.

E2E Scenario

Given a previous review predicted that AI tool selection would be aligned within 14 days, when signals and action items show zihrou, iron, and Louis still diverge, then verdict is miss or partial, rationale cites evidence, and the next action creates a tool alignment decision memo.

<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>公司AI化 預測驗證作戰台</title>
<style>
:root{--ink:#17202a;--muted:#607080;--line:#d8e0e7;--paper:#f6f8fb;--card:#fff;--blue:#1d4ed8;--green:#0f7f5c;--amber:#a16207;--red:#b3361d;--purple:#6d28d9}
*{box-sizing:border-box} body{margin:0;background:var(--paper);color:var(--ink);font-family:Inter,ui-sans-serif,system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;line-height:1.5}
header{background:#fff;border-bottom:1px solid var(--line);padding:28px clamp(20px,4vw,56px)} main{padding:24px clamp(20px,4vw,56px) 48px}
h1{margin:0 0 12px;font-size:clamp(30px,4vw,52px);line-height:1.05;max-width:1050px} h2{margin:0 0 12px;font-size:22px} h3{margin:0 0 6px;font-size:16px} p{margin-top:0} code{background:#eef3f8;padding:1px 5px;border-radius:4px}
.sub{max-width:1060px;color:var(--muted);font-size:17px}.grid{display:grid;gap:16px}.kpis{grid-template-columns:repeat(4,minmax(0,1fr));margin-top:22px}.two{grid-template-columns:1.1fr .9fr}.three{grid-template-columns:repeat(3,minmax(0,1fr))}.timeline{grid-template-columns:repeat(4,minmax(0,1fr))}.flow{grid-template-columns:repeat(5,minmax(0,1fr))}
.card{background:var(--card);border:1px solid var(--line);border-radius:8px;padding:18px;box-shadow:0 1px 2px rgba(23,32,42,.04)}.metric{font-size:34px;font-weight:780}.label{color:var(--muted);font-size:13px}
.pill{display:inline-flex;border:1px solid var(--line);border-radius:999px;padding:4px 10px;font-size:12px;background:#fff;margin:0 6px 8px 0;white-space:nowrap}.ok{color:var(--green)}.warn{color:var(--amber)}.bad{color:var(--red)}.info{color:var(--blue)}
table{width:100%;border-collapse:collapse;font-size:14px} th,td{text-align:left;padding:10px;border-bottom:1px solid var(--line);vertical-align:top} th{color:var(--muted);font-size:12px;text-transform:uppercase}
.day{border-left:4px solid var(--purple)}.step{border:1px solid var(--line);border-radius:8px;padding:12px;min-height:126px;background:#fbfdff}.step strong{display:block;color:var(--purple);margin-bottom:6px}.source a{color:var(--blue);word-break:break-word}
@media(max-width:920px){.kpis,.two,.three,.timeline,.flow{grid-template-columns:1fr}h1{font-size:34px}}
</style>
</head>
<body>
<header>
<span class="pill info">PLS production delivery pack</span><span class="pill ok">Solution: eval / system</span>
<h1>公司AI化 AI 預測驗證作戰台</h1>
<p class="sub">把「新增 AI 預測驗證模組」推成 production 級驗收系統:將上次 review 的預測變成 bets ledger,透過 signals、action items、GitHub commits、worker health、LINE/Drive 訊號自動核對命中、偏差、資料缺口與下一輪校正。</p>
<section class="grid kpis">
<div class="card"><div class="metric">99%</div><div class="label">公司AI化進度訊號曾推進到 98% → 99%</div></div>
<div class="card"><div class="metric ok">Bets</div><div class="label">策略預測必須進 ledger 才能驗證</div></div>
<div class="card"><div class="metric warn">2週</div><div class="label">zihrou / iron / Louis 工具選擇對焦風險</div></div>
<div class="card"><div class="metric">D30</div><div class="label">AI 管理層預測校正迴路成形</div></div>
</section>
</header>
<main class="grid">
<section class="grid two">
<div class="card">
<h2>本輪問題</h2>
<p>公司AI化已累積 persona 思路圖、heartbeat 節點、worker 隔離、預測深度生命體徵等 commit;但如果 AI review 的預測沒有被下一輪 signals 和 action items 自動核對,AI 會變成「很會說」而不是「會校正」。</p>
<span class="pill">Owner: Louis</span><span class="pill">Stakeholders: zihrou / iron</span><span class="pill">Due: D7 first scorecard</span><span class="pill">Acceptance: 命中/偏差可重跑</span>
</div>
<div class="card">
<h2>解法選型</h2>
<p><strong>eval / system</strong>。這不是單一報告或提醒,而是 AI 管理層的回歸測試系統:預測、證據、命中判定、校正、權限、稽核都要可追溯。</p>
</div>
</section>
<section class="card">
<h2>D1 / D7 / D14 / D30</h2>
<div class="grid timeline">
<div class="card day"><h3>D1</h3><p>定義 prediction bet schema、evidence mapping、hit/miss/partial/unknown rubric,並發布此作戰台。</p></div>
<div class="card day"><h3>D7</h3><p>接 20 筆上次 review bets,使用 signals / action items / GitHub commits 自動核對第一版命中率。</p></div>
<div class="card day"><h3>D14</h3><p>在公司AI化 weekly review 中加入 prediction calibration section,列出偏差原因與下輪模型調整。</p></div>
<div class="card day"><h3>D30</h3><p>建立 AI 管理層 scorecard:預測品質、工具選型一致性、worker 品質、專案推進命中率同表治理。</p></div>
</div>
</section>
<section class="card">
<h2>Purpose-to-Purpose E2E</h2>
<div class="grid flow">
<div class="step"><strong>原始目的</strong>公司AI化要讓 AI 成為決策層,而非只產出建議。</div>
<div class="step"><strong>預測</strong>每次 review 的「預期成果、風險、owner、期限」寫入 bets ledger。</div>
<div class="step"><strong>證據</strong>signals、action items、commits、worker health、LINE/Drive 訊號自動對照。</div>
<div class="step"><strong>校正</strong>命中、部分命中、未命中、無資料;輸出下一輪 route 與工具選型修正。</div>
<div class="step"><strong>價值</strong>降低錯誤決策延續、提升工具統一、縮短 AI 化對焦時間、提高 worker 產出可信度。</div>
</div>
</section>
<section class="grid two">
<div class="card">
<h2>驗證 Rubric</h2>
<table>
<thead><tr><th>Result</th><th>判定</th><th>下一步</th></tr></thead>
<tbody>
<tr><td><strong>hit</strong></td><td>期限內有直接證據支持預測。</td><td>提高該 route / signal 權重。</td></tr>
<tr><td><strong>partial</strong></td><td>方向正確但 scope、時間或 owner 偏差。</td><td>回填偏差原因,調整下一輪 ask。</td></tr>
<tr><td><strong>miss</strong></td><td>證據顯示預測沒有發生或方向錯誤。</td><td>要求反事實分析與 decision record。</td></tr>
<tr><td><strong>unknown</strong></td><td>缺資料,不能判定。</td><td>產生 data_gap task,不准算命中。</td></tr>
</tbody>
</table>
</div>
<div class="card">
<h2>資料 / API / 權限</h2>
<p><strong>Tables:</strong> <code>prediction_bets</code>, <code>prediction_evidence_links</code>, <code>prediction_verdicts</code>, <code>calibration_runs</code>, <code>tool_alignment_risks</code>.</p>
<p><strong>APIs:</strong> <code>POST /ai/reviews/:id/bets</code>, <code>POST /ai/predictions/verify</code>, <code>GET /ai/predictions/scorecard</code>.</p>
<p><strong>Permissions:</strong> AI worker 可提出 verdict;Louis 可 override;zihrou/iron 可補證據;所有 override 要 audit reason。</p>
</div>
</section>
<section class="grid three">
<div class="card"><h2>價值 / 錢路徑</h2><p>讓 AI 推進不靠感覺,能把錯誤預測及早止損,把準確 route 加碼,降低工具分歧與重工成本。</p></div>
<div class="card"><h2>人的能力提升</h2><p>Louis 看見 AI 何時準、何時偏;zihrou/iron 能用證據對焦工具選型,而不是各自憑經驗拉扯。</p></div>
<div class="card"><h2>下一輪升級</h2><p>接真實 review bet ledger,產生 weekly calibration report 與 worker/route 信任分數。</p></div>
</section>
<section class="card source">
<h2>Market Maturity Inputs</h2>
<p>Evidently documents data and prediction drift monitoring for production AI quality checks: <a href="https://docs.evidentlyai.com/metrics/preset_data_drift">Evidently data and prediction drift</a>.</p>
<p>Google's ML Test Score offers an actionable production-readiness rubric for ML systems: <a href="https://research.google/pubs/whats-your-ml-test-score-a-rubric-for-ml-production-systems/">Google ML Test Score</a>.</p>
<p>Evidently monitoring overview emphasizes batch evaluation and continuous collaboration around AI quality: <a href="https://docs.evidentlyai.com/docs/platform/monitoring_overview">Evidently monitoring overview</a>.</p>
</section>
</main>
</body>
</html>

Data Model

prediction_bets

  • id: uuid primary key.
  • review_id: uuid.
  • project_id: uuid.
  • prediction_text: text.
  • expected_outcome: text.
  • owner_id: uuid nullable.
  • due_at: timestamptz.
  • route: text.
  • confidence: numeric.
  • created_by_worker_id: text.
  • created_at: timestamptz.

prediction_evidence_links

  • id: uuid primary key.
  • prediction_bet_id: uuid.
  • source_type: enum signal, action_item, github_commit, line, drive, worker_health, heartbeat, manual.
  • source_ref: text.
  • evidence_summary: text.
  • observed_at: timestamptz.
  • polarity: enum supports, contradicts, neutral, data_gap.

prediction_verdicts

  • id: uuid primary key.
  • prediction_bet_id: uuid.
  • verdict: enum hit, partial, miss, unknown.
  • score: numeric.
  • rationale: text.
  • decided_by: enum worker, louis_override, review_board.
  • decided_at: timestamptz.
  • audit_reason: text nullable.

calibration_runs

  • id: uuid primary key.
  • project_id: uuid.
  • run_window_start: timestamptz.
  • run_window_end: timestamptz.
  • hit_rate: numeric.
  • partial_rate: numeric.
  • unknown_rate: numeric.
  • top_biases: jsonb.
  • next_actions: jsonb.

API / Sync

  • POST /ai/reviews/:review_id/bets
  • POST /ai/predictions/verify
  • GET /ai/predictions/scorecard
  • POST /ai/predictions/:id/override
  • GET /ai/predictions/:id/audit

Permissions / Audit

Workers can propose verdicts. Louis can override verdicts with reason. zihrou and iron can attach evidence or comment on tool alignment risk. All overrides and evidence removals are append-only audit events.

Decision Record

Decision

Use eval / system for this round.

Why

The project is no longer blocked by lack of ideas; it is blocked by whether AI decisions and predictions can be checked, trusted, and corrected. A standalone report would not create a feedback loop. A pure dashboard would lack rubric and audit. The right product shape is an evaluation system with production scorecard.

Options Considered

  • Communication: too small; tool choice alignment needs evidence and calibration.
  • Project pack: useful, but does not verify prediction accuracy.
  • Eval / system: best fit; supports bets ledger, evidence mapping, verdicts, calibration runs, permissions, and audit.

Adoption Status

Recommended for D1. D7 should ingest the first 20 previous review predictions.

Feedback Needed If Rejected

Clarify whether the blocker is missing historical review data, unclear verdict authority, or concern about AI overriding human judgment.

{
"project": "AI 自建專案:公司AI化 的最大化推進",
"job_id": "0a46b6bc-5e77-4a5c-bbb0-fc487a268d98",
"selected_solution": "eval/system",
"learned_signal": "AI prediction verification module now uses signals and action items to check whether previous review predictions were hit.",
"market_learning": "Production AI monitoring relies on drift checks, batch evaluation, scorecards, rubrics, and auditability.",
"next_run_bias": "Prefer bets ledger, evidence links, verdict rubric, and calibration report over narrative review summaries.",
"must_check_next": [
"Are historical review predictions available as structured bets?",
"Can each bet link to at least one signal, action item, commit, worker health event, LINE, or Drive source?",
"Who can override a verdict?",
"How are unknown verdicts excluded from hit rate?"
]
}

Market Maturity

Comparable Practices

PLS Gap

PLS has signals, action items, commits, heartbeat nodes, persona maps, and prediction module commits. The gap is a closed calibration loop that ties previous predictions to later evidence and forces a verdict.

This Round Upgrade

This pack adds verdict taxonomy, data schema, API design, permissions, audit trail, acceptance tests, people sync, and D30 scorecard path.

People Sync

LINE Draft

Louis,這輪我把「公司AI化 AI 預測驗證」做成作戰台。之後每次 review 的預測會寫成 bets ledger,再用 signals、action items、GitHub commits、worker health、LINE/Drive 證據判定 hit / partial / miss / unknown。D7 建議先拿 20 筆上次 review 預測跑第一版 scorecard,尤其驗證 zihrou、iron、Louis 的 AI 工具選型是否真的對焦。

Ask

請確認:prediction verdict 最終 override 權限是否由 Louis 保留,zihrou / iron 是否只補證據與異議?

If No Reply

先以 worker verdict 作為草稿,所有 unknown 都列 data_gap,不納入命中率。

Production Brief

場景

專案:AI 自建專案:公司AI化 的最大化推進。

本輪訊號:新增 AI 預測驗證模組,透過 signals、action items 等多來源證據,自動核對上次 review 的預測是否命中。相關訊號包含 prediction 深度生命體徵、bets ledger 回填、persona 思路地圖、worker 隔離、heartbeat 節點與公司AI化進度 98% → 99%。

產出

建立「公司AI化 AI 預測驗證作戰台」,把 review prediction 變成可追蹤 bets ledger,並用 signals / action items / GitHub commits / worker health / LINE / Drive 證據自動判定 hit、partial、miss、unknown。

D1 / D7 / D14 / D30

  • D1: 定義 prediction bet schema、evidence mapping、verdict rubric。
  • D7: 接 20 筆上次 review bets,產生第一版命中率與 data gap。
  • D14: weekly review 固定加入 prediction calibration section。
  • D30: 形成 AI 管理層 scorecard,連接工具選型一致性、worker 品質與專案推進命中率。

Owner / Due / Acceptance

  • Owner: Louis.
  • Stakeholders: zihrou, iron.
  • Due: D7 first scorecard.
  • Acceptance: 每筆 prediction 有 owner、due、expected outcome、evidence links、verdict、confidence、next calibration action。

Production Readiness

Ready Now

  • Openable validation console.
  • Required production appendix pack.
  • Prediction bet schema and verdict rubric.
  • D1/D7/D14/D30 operating path.
  • People sync and learning memory.

Integration Required

  • Ingest historical review predictions.
  • Link signals, action items, GitHub commits, worker health, LINE, and Drive evidence.
  • Run weekly calibration.
  • Store verdicts and Louis overrides with audit trail.

Failure / Rollback

If evidence is insufficient, use unknown, not hit. If Louis overrides worker verdict, require an audit reason. If data sources disagree, keep both supporting and contradicting evidence links and route to review.

Skill / Tool Usage

Tools Used

  • PLS helper: doctor, touch, claim, context, progress, upload-files, complete.
  • Web search: checked Evidently AI monitoring and Google ML Test Score.
  • GitHub CLI: publishes the production pack as a public Gist.
  • curl: verifies the primary artifact returns HTTP 200.

Evidence

The job was claimed and context was read through the helper. Progress was written before production work. External maturity references were checked. The final artifact is published and verified before completion.

Solution Selection

Selected route: eval / system.

This is not a generic project update. It is a prediction verification system for AI management quality. The system must evaluate whether previous predictions were right using multi-source evidence, then improve future routing, tool choice, and worker trust.

Production stack:

  • Framework: prediction bet -> evidence -> verdict -> calibration -> route update.
  • Workflow: capture bets, link signals, score verdicts, create next actions.
  • Data model: bets, evidence links, verdicts, calibration runs.
  • Tool: openable validation console.
  • Acceptance: pass/fail rubric and audit trail.
  • Upgrade: real weekly calibration report.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment