Skip to content

Instantly share code, notes, and snippets.

@esz135888
Last active May 23, 2026 22:34
Show Gist options
  • Select an option

  • Save esz135888/a7a92a1d84f15c366669fad6dce04818 to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/a7a92a1d84f15c366669fad6dce04818 to your computer and use it in GitHub Desktop.
PLS AI prediction verification eval console - job 073eefe3

E2E Acceptance Tests

A. Ledger Import

Given prediction-ledger-seed.csv, when import runs, then rows with prediction_id, expected_signal_type, expected_by, and risk_tier are accepted.

Pass: accepted_rows >= 3 and schema_errors = 0.

B. Evidence Matching

Given a prediction expects github_commit, when a commit signal appears with matching project_id and semantic overlap, then prediction_evidence_links.match_strength >= 0.7.

Pass: evidence link is created with source_type github_commit.

C. Verdict Scoring

Given evidence link strength >= 0.7 before expected_by, when scoring runs, then verdict is hit or partial based on rubric.

Pass: hit_score >= 70 for hit, 40-69 for partial.

D. Watchdog Alert

Given hit_rate < 60% or evidence_coverage < 80% after 20 predictions, when watchdog runs, then people_sync alert is generated for Louis/zihrou/iron.

Pass: alert includes metric, threshold, owner, due date, and remediation.

E. Governance Guard

Given risk_tier = high, when auto verdict would change project state, then verdict is marked needs_human_review.

Pass: no high-risk project state changes without Louis or zihrou approval.

Current Verification

  • HTML Eval Console is openable.
  • JSON learning memory parses.
  • Gist returns HTTP 200.
  • Gist file list includes required files.
  • PLS upload-files reports uploaded files.

Artifact URL / PR Record

Primary artifact URL: https://gist.github.com/esz135888/a7a92a1d84f15c366669fad6dce04818

Type

Shared-cloud Gist artifact pack. No GitHub PR or deployment is claimed for this job.

Verification

  • Gist HTTP status: HTTP/2 200 verified on 2026-05-24 Asia/Taipei.
  • Gist file list: 13 files verified by gh gist view a7a92a1d84f15c366669fad6dce04818 --files.
  • PLS upload-files: pending before final PLS sync.

Required Contract Kinds

primary_artifact, solution_selection, market_context, production_readiness, e2e_verification, people_sync, learning_memory, skill_usage, market_maturity, production_acceptance, landing_record, tool, dashboard, doc.

Data Model / API / Sync / Permission Spec

Tables

prediction_ledger

column type required note
prediction_id uuid yes prediction key
review_id uuid yes source review
project_id uuid yes PLS project
predicted_at timestamptz yes created time
prediction_text text yes what AI predicted
expected_signal_type text yes action_completed/github_commit/status_change/message
expected_by date yes validation deadline
risk_tier enum yes low/medium/high
owner_profile_id uuid no accountable person
status enum yes pending/hit/miss/partial/expired

prediction_evidence_links

column type required note
evidence_id uuid yes evidence key
prediction_id uuid yes linked prediction
source_type enum yes signal/action_item/github_commit/deliverable/person_reflection
source_id uuid/text yes source row id
event_time timestamptz yes evidence time
match_strength numeric yes 0-1
match_reason text yes why linked

prediction_validation_scores

column type required note
score_id uuid yes score key
prediction_id uuid yes linked prediction
hit_score numeric yes 0-100
verdict enum yes hit/partial/miss/needs_human_review
false_positive_risk numeric yes 0-1
validated_at timestamptz yes score time
validated_by text yes worker/model/human

API / Sync

  • POST /api/prediction-validation/ledger/import imports prediction ledger rows.
  • POST /api/prediction-validation/evidence/sync syncs signals, action items, commits, deliverables, people reflections.
  • GET /api/prediction-validation/dashboard?project_id=... returns hit rate, evidence coverage, and stale predictions.
  • POST /api/prediction-validation/:prediction_id/review records human override and audit reason.

Permission / Audit

  • Louis can approve high-risk verdict changes.
  • zihrou can define high-risk rubric and governance exceptions.
  • iron can modify schema/API implementation.
  • PLS worker can score low/medium risk predictions but cannot auto-change high-risk project status.
  • Audit log must store old verdict, new verdict, reason, actor, timestamp, source evidence ids.

PLS Backend / Worker Flow

  1. Review worker writes prediction ledger after each review.
  2. Nightly worker syncs new evidence.
  3. Eval worker scores prediction.
  4. Watchdog alerts if stale, low coverage, or high-risk conflict.
  5. Learning memory updates next review prompt.

Decision Record

Decision

Build Prediction Verification Eval Console for this round.

Problem

The source signal says AI prediction validation was added, but the project still needs a productized operating pack: ledger schema, scoring rubric, evidence sources, governance, watchdog thresholds, and adoption path. The org's AI tool choice is not unified, so the first durable step should be a tool-agnostic eval contract.

Options Considered

  1. Communication unblock only.
    • Rejected: useful for alignment but cannot validate predictions.
  2. Research-only benchmark.
    • Rejected: would not produce production artifact.
  3. Full autonomous agent.
    • Rejected: too risky before hit-rate and false-positive data exists.
  4. Eval + system + watchdog pack.
    • Recommended: measurable, tool-agnostic, and safe to adopt.

Recommendation

Adopt the eval console and seed ledger first. Require 20 validations before deciding whether to automate more authority.

Adoption Status

Proposed for job 073eefe3-10c0-438e-b2b7-42a3fbf0e85f.

Landing Path

Owner: Louis. Supervisor / governance: zihrou. Implementation support: iron. Due: 2026-05-30.

Feedback If Not Adopted

Return one of: actual prediction-validation code path, preferred data source priority, high-risk rubric, or reason why this should be promoted straight to agent.

{
"job_id": "073eefe3-10c0-438e-b2b7-42a3fbf0e85f",
"project": "AI 自建專案:公司AI化",
"learned_at": "2026-05-24T06:40:00+08:00",
"solution_selection": "eval + system + watchdog",
"market_context": [
{
"source": "OpenAI evaluation best practices",
"lesson": "Prediction validation should use production, historical, synthetic, and human-curated data, not a single judge score."
},
{
"source": "LangSmith evaluation concepts",
"lesson": "Evals should support lifecycle measurement, including production monitoring and benchmarking."
},
{
"source": "MLOps patterns",
"lesson": "AI systems need continuous monitoring, metadata, governance, and feedback loops tied to business KPIs."
}
],
"pls_next_checks": [
"Check whether each review writes explicit predictions into a ledger.",
"Do not trust prediction quality until evidence_coverage and false_positive_rate are tracked.",
"Require human approval before high-risk prediction verdicts affect people, budget, or project state.",
"If evidence sync fails twice, dispatch repo_change for backend integration."
],
"assumptions_overturned": [
"A commit that adds a validation module is not enough; PLS needs a visible eval contract and adoption path.",
"AI prediction quality cannot be inferred from confidence text; it needs later evidence.",
"Tool choice conflict should be resolved by schema and acceptance first, vendor second."
],
"next_iteration_condition": "Run 20 prediction validations; if hit_rate >= 70%, false_positive_rate <= 15%, and evidence_coverage >= 80%, promote to backend workflow or agent."
}

Market Maturity

Current Practice Check

Checked on 2026-05-24 Asia/Taipei using web search.

Mature Market Pattern

Mature AI/LLM evaluation is a loop: production traces become eval cases, golden datasets are maintained, regression tests run before changes, observability captures traces and costs, and human review governs high-risk decisions.

Comparable Practices

  • OpenAI evaluation guidance emphasizes domain-specific, human-curated, production, historical, and synthetic eval data, and matching evaluation method to task.
  • LangSmith/LangChain eval concepts frame evals across lifecycle: pre-deployment testing, production monitoring, benchmarking, and human annotation workflows.
  • LangChain's production eval loop describes converting monitored production failures into test cases, verifying fixes, deploying, then monitoring again.
  • MLOps maturity patterns include monitoring, governance, drift, business KPIs, metadata tracking, and feedback loops.

PLS Gap

PLS has rich signals, action items, commits, deliverables, and reflections, but prediction quality is not yet consistently closed-looped. Without a ledger and scoring rubric, AI review can sound smart while no one knows if it was right.

Upgrade This Round

This pack adds a measurable eval layer:

  • prediction ledger with expected evidence.
  • evidence matching across PLS sources.
  • hit/partial/miss scoring.
  • high-risk governance guard.
  • watchdog thresholds for stale or low-quality prediction loops.
prediction_id review_id project_id predicted_at prediction_text expected_signal_type expected_by risk_tier owner status
PRED-001 REV-20260523-001 d2afbba2-f20a-4ca5-ab6b-8e848e5532ef 2026-05-23T14:50:36Z 新增 AI 預測驗證模組會產生後續 evidence sync 需求 github_commit 2026-05-30 medium iron pending
PRED-002 REV-20260523-002 d2afbba2-f20a-4ca5-ab6b-8e848e5532ef 2026-05-23T14:50:36Z zihrou 需要定義高風險 prediction 的人工審核邊界 action_item 2026-05-30 high zihrou pending
PRED-003 REV-20260523-003 d2afbba2-f20a-4ca5-ab6b-8e848e5532ef 2026-05-23T14:50:36Z Louis 會以 7 天內是否有可驗證成果判斷是否加碼 AI 管理層 message_or_decision 2026-05-30 high Louis pending
<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>AI 預測驗證 Eval Console</title>
<style>
:root { --ink:#172033; --muted:#617086; --line:#d8dee9; --bg:#f5f7fb; --panel:#fff; --green:#087443; --red:#b42318; --amber:#a15c07; --blue:#175cd3; }
* { box-sizing:border-box; }
body { margin:0; background:var(--bg); color:var(--ink); font:14px/1.5 -apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif; }
header { background:var(--panel); border-bottom:1px solid var(--line); padding:28px 32px; }
h1 { margin:0 0 6px; font-size:26px; letter-spacing:0; }
h2 { margin:0 0 12px; font-size:17px; }
main { max-width:1240px; margin:0 auto; padding:22px 18px 42px; display:grid; gap:16px; }
section { background:var(--panel); border:1px solid var(--line); border-radius:8px; padding:18px; }
.grid { display:grid; grid-template-columns:repeat(4,minmax(0,1fr)); gap:12px; }
.card { border:1px solid var(--line); border-radius:8px; background:#fbfcff; padding:14px; min-height:116px; }
.label { color:var(--muted); font-size:12px; text-transform:uppercase; }
.value { font-size:24px; font-weight:750; margin-top:4px; }
.green { color:var(--green); } .red { color:var(--red); } .amber { color:var(--amber); } .blue { color:var(--blue); }
table { width:100%; border-collapse:collapse; }
th,td { text-align:left; vertical-align:top; border-bottom:1px solid var(--line); padding:10px 8px; }
th { color:var(--muted); font-size:12px; }
code { background:#eef2f7; border-radius:4px; padding:1px 5px; }
.small { color:var(--muted); font-size:12px; }
.pill { display:inline-block; border:1px solid var(--line); border-radius:999px; padding:2px 9px; background:#fff; }
@media (max-width:900px){ header{padding:22px 18px;} .grid{grid-template-columns:1fr;} }
</style>
</head>
<body>
<header>
<h1>AI 預測驗證 Eval Console</h1>
<div class="small">Job 073eefe3-10c0-438e-b2b7-42a3fbf0e85f · owner Louis · governance zihrou · implementation iron · due 2026-05-30</div>
</header>
<main>
<section>
<h2>本輪驗證狀態</h2>
<div class="grid">
<div class="card"><div class="label">Solution Type</div><div class="value blue">eval + system</div><div class="small">加 watchdog 閾值,不直接升 agent。</div></div>
<div class="card"><div class="label">Seed Ledger</div><div class="value green">3 rows</div><div class="small">先跑 20 筆成為 golden set。</div></div>
<div class="card"><div class="label">Risk Guard</div><div class="value amber">human review</div><div class="small">高風險 verdict 不自動改狀態。</div></div>
<div class="card"><div class="label">Next Gate</div><div class="value red">needs data</div><div class="small">需要 evidence coverage 與命中率。</div></div>
</div>
</section>
<section>
<h2>評分規則</h2>
<table>
<tr><th>Metric</th><th>Pass</th><th>Action</th></tr>
<tr><td><code>evidence_coverage</code></td><td>&gt;= 80%</td><td>低於門檻時要求 iron 補 source sync。</td></tr>
<tr><td><code>hit_rate</code></td><td>&gt;= 70%</td><td>低於 60% alert Louis 暫停自動化擴權。</td></tr>
<tr><td><code>false_positive_rate</code></td><td>&lt;= 15%</td><td>超標時重修 rubric/golden set。</td></tr>
<tr><td><code>time_to_validation</code></td><td>&lt;= 7 days</td><td>逾期 prediction 進 watchdog。</td></tr>
</table>
</section>
<section>
<h2>資料流</h2>
<table>
<tr><th>來源</th><th>用途</th><th>Join Key</th></tr>
<tr><td>signals</td><td>找預測後的實際事件</td><td>project_id + signal_type + semantic match</td></tr>
<tr><td>action_items</td><td>驗證 owner 是否被派工或完成</td><td>project_id + assignee + due_date</td></tr>
<tr><td>github_commit</td><td>驗證技術預測是否落地</td><td>project_id + commit summary</td></tr>
<tr><td>deliverables</td><td>驗證 AI 是否產出 real files</td><td>hermes_job_id + deliverable_id</td></tr>
<tr><td>people_reflections</td><td>驗證人設/近期關注是否更新</td><td>profile_id + project_id</td></tr>
</table>
</section>
<section>
<h2>Watchdog Alert</h2>
<table>
<tr><th>條件</th><th>通知對象</th><th>處理</th></tr>
<tr><td>20 筆後 hit_rate &lt; 60%</td><td>Louis</td><td>暫停 AI prediction 擴權,重修 rubric。</td></tr>
<tr><td>high risk verdict 無人工審核</td><td>zihrou</td><td>補 approval matrix。</td></tr>
<tr><td>evidence sync error &gt; 5%</td><td>iron</td><td>派 repo_change 修 source sync。</td></tr>
</table>
</section>
</main>
</body>
</html>

Production Acceptance

Pass Conditions

  • Primary artifact opens through shared-cloud Gist.
  • Artifact pack includes production brief, solution selection, data model, acceptance tests, decision record, market maturity, skill usage, production acceptance, sources, learning memory, HTML console, seed CSV, and artifact URL record.
  • Artifacts JSON includes primary_artifact, solution_selection, market_context, production_readiness, e2e_verification, people_sync, learning_memory, skill_usage, market_maturity, production_acceptance.
  • Owner, due, adoption signal, and next iteration condition are explicit.

Fail Conditions

  • Summary-only completion.
  • No openable primary artifact URL.
  • No data/API/sync/permission path.
  • No governance guard for high-risk prediction changes.
  • No E2E acceptance tests.
  • No market maturity check.

Owner / Due / Acceptance

  • Owner: Louis.
  • Governance: zihrou.
  • Implementation support: iron.
  • Due: 2026-05-30.
  • Acceptance: 20 prediction validations, evidence_coverage >= 80%, hit_rate >= 70%, false_positive_rate <= 15%, all high-risk verdict changes require human approval.

Watchdog Thresholds

  • Alert Louis if hit_rate < 60% after 20 predictions.
  • Alert zihrou if high-risk predictions are auto-scored without review.
  • Alert iron if evidence sync error rate > 5% for two runs.

Production Brief:AI 預測驗證 Eval Console

場景

公司AI化已出現「新增 AI 預測驗證模組」的 commit 訊號,但目前 zihrou、iron、Louis 對 AI 工具選擇尚未統一,系統化建置容易分歧。本輪不做泛泛策略文件,而是交付一個可採用的 Prediction Verification Eval Console:把上次 review 的預測,透過 signals、action items、GitHub commits、deliverables、people reflections 等證據核對是否命中,並把結果回寫成下一輪派工與治理訊號。

Solution Selection

選型:eval + system + watchdog

不是更小的 doc/sop:因為問題不是概念不清,而是需要可重複評分、跨資料源同步與 alert。

不是更大的 agent:目前還不應讓 agent 自動改預測或封鎖專案,先用 eval/system/watchdog 產出可稽核結果,等準確率與 false positive 穩定後再升級 agent。

30 天路徑

  • D1:建立 prediction ledger seed、評分規則、evidence join key、HTML Console。
  • D7:接入 PLS signals/action_items/github_commit/deliverables,跑第一輪 20 筆 prediction validation。
  • D14:新增 watchdog:命中率低於 60%、逾期 prediction 未驗證、或 high-risk false positive 時 alert Louis/zihrou/iron。
  • D30:變成 PLS 後台模組:review 產生 prediction → 自動等待 evidence → 驗證命中 → 產 learning memory → 調整下一輪 AI 推進策略。

目的到目的 E2E

原始目的:讓 PLS 不只會預測,還會驗證上次預測是否命中。 產出物:可開啟 Eval Console、seed CSV、data model、API/sync spec、acceptance tests、decision record、market maturity、production acceptance、learning memory。 人採用:Louis 看命中率與錯誤類型;zihrou 協調工具選型與治理;iron 落實系統建置與資料欄位。 指標改善:prediction_hit_rate、evidence_coverage、false_positive_rate、time_to_validation、review_quality_delta。

價值 / 錢路徑

  • 降低風險:AI 預測若長期錯誤,會被 watchdog 提醒,不會默默影響決策。
  • 節省成本:停止人工回頭翻 LINE/GitHub/action item 判斷 AI 是否準。
  • 提高轉換:準確預測讓 PLS 更快找出該跟進的人與專案。
  • 釋放人力:把「上次說的有沒有發生」從人工回顧變成可量化 eval。

提升人的能力

Louis 可以用命中率決定是否加碼 AI 管理;zihrou 可以用 false positive 類型判斷流程/授權問題;iron 可以把工具選型爭議轉成欄位、API、驗收,不再停在意見對齊。

Solution Stack

  • 脈絡框架:prediction ledger → evidence collector → scoring rubric → review console → watchdog alert → learning memory。
  • 作業流程:每次 review 寫入 prediction;每日收 evidence;48 小時後初評,7 天後定評。
  • 資料模型:見 data-model.md
  • 可操作工具:prediction-verification-eval-console.htmlprediction-ledger-seed.csv
  • 驗收指標:見 acceptance-tests.mdproduction-acceptance.md
  • 採用與下一輪升級:若連續 2 週命中率穩定 >= 70%,升級成 agent;若低於 60%,先調整 rubric/golden set。

People Sync / LINE 草稿

Louis:這輪已把「AI 上次預測有沒有命中」做成可驗證 console。請看三個數字:命中率、證據覆蓋率、false positive。若 7 天內能累積 20 筆 prediction,我們就能判斷 AI 管理層是否值得加碼。

zihrou:請幫忙定義哪些 prediction 屬於高風險,需要人工審核後才能影響人員或專案狀態。

iron:請用 data-model/API spec 對齊 signals、action_items、github_commit、deliverables 的 join key;先不要爭工具品牌,先把驗證欄位跑通。

上一版問題 → 本輪修改 → 驗證結果 → 下一輪建議

上一版問題:已有預測驗證 commit 訊號,但缺可交付的 production contract 與 adoption path。 本輪修改:建立 eval console、ledger seed、資料模型、API/權限、驗收與 watchdog 閾值。 驗證結果:本地 JSON 驗證、Gist HTTP 200、Gist file list、PLS upload-files。 下一輪建議:回收 20 筆 prediction validation,建立 first golden set 與 dashboard 趨勢。

Skill / Tool Usage

Selected Skills / Tools

  • purpose_e2e_toolbox_v2: used for end-to-end purpose, value path, capability improvement, solution stack, data model, acceptance, and decision record.
  • PLS solution catalog: selected eval + system + watchdog.
  • Web search: used for current market maturity and comparable eval/observability practices.
  • Shell tools: python3 -m json.tool, find, rg.
  • GitHub CLI: used to create and verify shared-cloud Gist.
  • URL verification: curl -I -L -s verifies openable artifact.
  • PLS helper: doctor, touch, claim, context, progress, upload-files, complete.

Why These Tools

This is not a code repository change in the current workspace; it is a production pack for PLS to implement/route. Gist provides a durable shared artifact. CSV gives seed data. HTML gives a human-operable console. Markdown specs define backend implementation and governance.

Evidence / Test Result

  • prediction-verification-eval-console.html is primary artifact.
  • prediction-ledger-seed.csv gives testable seed data.
  • learning-memory.json passes JSON parsing.
  • Gist HTTP 200 and file list verified before complete.
  • PLS upload-files must return uploaded file count.

Solution Selection

Selected Type

eval + system + watchdog

Why

AI 預測驗證不是一次性文件,而是長期品質迴路。它需要:

  • eval:判斷 prediction 是否命中。
  • system:串接 signals、action_items、GitHub commits、deliverables、people reflections。
  • watchdog:命中率或覆蓋率低於門檻時提醒人,不讓錯誤 prediction 靜默累積。

Why Not Smaller

  • communication 只能催人補資料,無法驗證 prediction。
  • doc 只能說明規則,無法產生命中率。
  • sop 適合人工流程,但這裡需要跨資料源自動 evidence join。

Why Not Bigger

  • agent 現階段太早。若 agent 自動改 prediction 或調整人員任務,會涉及治理風險。先用可稽核 eval 結果建立信任。

Adoption Condition

連續 2 週至少 20 筆 validation,hit_rate >= 70%、false_positive_rate <= 15%、evidence_coverage >= 80%,才升級為 autonomous worker。

Market Context Sources

Checked date: 2026-05-24 Asia/Taipei.

  1. OpenAI, "Evaluation best practices". URL: https://platform.openai.com/docs/guides/evaluation-best-practices Use: supports multiple eval data sources and task-aligned evaluation methods.

  2. LangChain, "LangSmith evaluation concepts". URL: https://docs.langchain.com/langsmith/evaluation-concepts Use: supports lifecycle evals from pre-deployment testing to production monitoring and benchmarking.

  3. LangChain, "LLM Evals: Production Monitoring to Regression Tests". URL: https://www.langchain.com/articles/llm-evals Use: supports production monitoring → test case → fix verification → deploy → monitor loop.

  4. MLOps / ModelOps reference. URL: https://en.wikipedia.org/wiki/MLOps Use: supports continuous monitoring, metadata tracking, governance, and feedback loops as mature operating pattern.

  5. PLS context for this job. URL: inline PLS context Use: source signal says AI prediction validation module was added and tool-choice alignment is blocked across zihrou, iron, and Louis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment