Skip to content

Instantly share code, notes, and snippets.

@esz135888
Created May 23, 2026 20:24
Show Gist options
  • Select an option

  • Save esz135888/e68b16f5df3de0cb2a2813fb0afc33ae to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/e68b16f5df3de0cb2a2813fb0afc33ae to your computer and use it in GitHub Desktop.
PLS job 7a70ab5d AI prediction D7 calibration run control tower

Acceptance Tests

Test 1: D1 Readiness Gate

Given an accepted reviewer inbox and 50-case seed queue, when the worker starts a D7 calibration run, then the run must be created with exactly 50 unique calibration_run_item rows and source freshness metadata.

Pass evidence:

  • calibration_run.status=ready
  • total_items=50
  • duplicate prediction ids = 0
  • source snapshot timestamp exists

Test 2: Batch Label Completeness

Given a running calibration run, when evidence from signals and action items is processed, then every item must receive one match_label and a rationale.

Pass evidence:

  • labeled item count = 50
  • labels only use hit, miss, partial, unknown
  • no blank label_rationale
  • every label has at least one evidence reference or a source-gap reason

Test 3: Unknown Threshold

Given the completed labels, when the scorecard is calculated, then unknown rate must be below 25% for productization to proceed.

Pass evidence:

  • unknown_count / total_items < 0.25
  • if not, pass_status=blocked
  • blocked status creates source_adapter_gap correction routes

Test 4: Reviewer Sampling

Given a labeled run, when reviewers sample the batch, then at least 5 items must have reviewer sample results.

Pass evidence:

  • reviewer_sample_count >= 5
  • sample includes at least one non-hit item when non-hit items exist
  • reviewer agreement rate is calculated
  • disputed cases are routed

Test 5: Correction Routing

Given any miss, partial, unknown, or reviewer dispute, when the run closes, then each item must have a correction route or a documented ignore reason.

Pass evidence:

  • unrouted non-hit count = 0
  • each route has owner, due date, route type, and next action
  • route types map to D14 work categories

Test 6: Completion and Adoption Gate

Given the scorecard and routes, when the worker marks the deliverable ready, then PLS must have a durable artifact URL, owner/due/acceptance, E2E evidence, data/toolbox upgrade path, and decision record.

Pass evidence:

  • primary artifact URL is openable
  • decision-record.md is present
  • learning-memory.json is valid JSON
  • LINE summary is short and references run gate, not raw document text

E2E Verification Result for This Pack

This artifact pack passes structural verification when:

  • HTML control tower is present.
  • Production brief includes D1/D7/D14/D30, purpose-to-purpose E2E, value/money path, and human capability improvement.
  • Data model defines DB entities, API contract, sync rules, permissions, audit, and PLS backend integration.
  • Acceptance tests define measurable gates before completion.
  • Decision record records options, recommendation, adoption state, landing path, and feedback needed if rejected.

Artifact URL or PR

Durable primary artifact:

https://gist.github.com/esz135888/e68b16f5df3de0cb2a2813fb0afc33ae

Gist id:

e68b16f5df3de0cb2a2813fb0afc33ae

Verification plan:

  • HTTP HEAD follows to a 200 GitHub response.
  • Gist file list includes the HTML control tower, production brief, data model, acceptance tests, decision record, learning memory, sources, and this artifact record.
<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>AI Prediction Verification D7 Calibration Run Control Tower</title>
<style>
:root {
--ink: #172026;
--muted: #5c6873;
--line: #d8dde3;
--bg: #f6f8fb;
--panel: #ffffff;
--blue: #2457d6;
--green: #16845b;
--amber: #a96600;
--red: #b42318;
}
* { box-sizing: border-box; }
body {
margin: 0;
background: var(--bg);
color: var(--ink);
font-family: Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
line-height: 1.5;
}
header {
background: #ffffff;
border-bottom: 1px solid var(--line);
padding: 24px clamp(18px, 4vw, 48px);
}
h1 { margin: 0; font-size: clamp(24px, 3vw, 38px); letter-spacing: 0; }
h2 { margin: 0 0 12px; font-size: 18px; }
h3 { margin: 0 0 8px; font-size: 15px; }
p { margin: 0 0 10px; }
main { padding: 22px clamp(18px, 4vw, 48px) 48px; }
.sub { color: var(--muted); max-width: 1100px; margin-top: 8px; }
.grid { display: grid; gap: 16px; }
.cols-4 { grid-template-columns: repeat(4, minmax(0, 1fr)); }
.cols-3 { grid-template-columns: repeat(3, minmax(0, 1fr)); }
.cols-2 { grid-template-columns: repeat(2, minmax(0, 1fr)); }
.panel {
background: var(--panel);
border: 1px solid var(--line);
border-radius: 8px;
padding: 16px;
}
.metric {
display: flex;
flex-direction: column;
min-height: 118px;
justify-content: space-between;
}
.label { color: var(--muted); font-size: 13px; }
.value { font-size: 30px; font-weight: 760; letter-spacing: 0; }
.ok { color: var(--green); }
.warn { color: var(--amber); }
.stop { color: var(--red); }
.tag {
display: inline-flex;
align-items: center;
height: 24px;
padding: 0 8px;
border-radius: 999px;
border: 1px solid var(--line);
color: var(--muted);
font-size: 12px;
margin-right: 6px;
background: #fbfcfe;
}
.stage {
border-left: 4px solid var(--blue);
padding-left: 12px;
}
table {
width: 100%;
border-collapse: collapse;
font-size: 13px;
}
th, td {
border-bottom: 1px solid var(--line);
padding: 10px 8px;
text-align: left;
vertical-align: top;
}
th { color: var(--muted); font-weight: 650; background: #fbfcfe; }
code {
background: #eef2f7;
padding: 2px 5px;
border-radius: 4px;
font-size: 12px;
}
.flow {
display: grid;
grid-template-columns: repeat(6, minmax(0, 1fr));
gap: 10px;
margin-top: 10px;
}
.flow div {
border: 1px solid var(--line);
background: #fbfcfe;
border-radius: 8px;
padding: 10px;
min-height: 84px;
}
.line { height: 1px; background: var(--line); margin: 18px 0; }
ul { padding-left: 18px; margin: 0; }
li { margin: 6px 0; }
@media (max-width: 980px) {
.cols-4, .cols-3, .cols-2, .flow { grid-template-columns: 1fr; }
}
</style>
</head>
<body>
<header>
<h1>AI Prediction Verification D7 Calibration Run Control Tower</h1>
<p class="sub">把上一輪 Reviewer Decision Inbox 推進成可執行的 50-case calibration run:有 owner、due、驗收門檻、資料模型、稽核邊界與下一輪 worker 記憶,避免停在文件堆疊。</p>
<p><span class="tag">Owner: Louis</span><span class="tag">Reviewers: zihrou / iron</span><span class="tag">Due: 2026-05-31</span><span class="tag">Gate: unknown &lt; 25%</span></p>
</header>
<main class="grid">
<section class="grid cols-4">
<div class="panel metric">
<span class="label">Batch Scope</span>
<span class="value">50</span>
<span class="label">accepted seed cases from the review queue</span>
</div>
<div class="panel metric">
<span class="label">Reviewer Sample</span>
<span class="value ok">>=10%</span>
<span class="label">minimum 5 items manually checked</span>
</div>
<div class="panel metric">
<span class="label">Unknown Ceiling</span>
<span class="value warn">&lt;25%</span>
<span class="label">otherwise source adapter work blocks dashboarding</span>
</div>
<div class="panel metric">
<span class="label">Completion Rule</span>
<span class="value stop">No Pass, No Product</span>
<span class="label">scorecard and routing required before rollout</span>
</div>
</section>
<section class="panel">
<h2>30-Day Development Path</h2>
<div class="grid cols-4">
<div class="stage"><h3>D1 Readiness</h3><p>Confirm Reviewer Decision Inbox status, lock 50 accepted cases, validate signal/action item source freshness, and assign run owner.</p></div>
<div class="stage"><h3>D7 Batch Run</h3><p>Execute calibration, label hit/miss/unknown, sample reviewer checks, route every miss or source gap, and publish scorecard.</p></div>
<div class="stage"><h3>D14 Correction Loop</h3><p>Cluster repeated miss reasons, dispatch source adapter or prediction rubric fixes, then re-run the affected cohort.</p></div>
<div class="stage"><h3>D30 Operating Cadence</h3><p>Turn the run into a weekly management scorecard with threshold history, people sync, and productization gate.</p></div>
</div>
</section>
<section class="panel">
<h2>Purpose-to-Purpose E2E</h2>
<div class="flow">
<div><strong>Original Purpose</strong><br>Know whether AI review predictions actually became true.</div>
<div><strong>Inputs</strong><br>Reviewer decisions, signals, action items, previous review predictions.</div>
<div><strong>Run</strong><br>Calibration batch labels hit, miss, partial, or unknown with evidence links.</div>
<div><strong>Adoption</strong><br>Louis approves go/no-go; zihrou/iron review disputed samples.</div>
<div><strong>Improvement</strong><br>Route miss reasons to correction tasks and source gaps to adapters.</div>
<div><strong>Measured Result</strong><br>Unknown rate, hit rate, reviewer agreement, cycle-time saved, risk reduced.</div>
</div>
</section>
<section class="grid cols-2">
<div class="panel">
<h2>Value and Money Path</h2>
<ul>
<li>Revenue: only productize AI recommendations once evidence shows useful prediction accuracy.</li>
<li>Cost: reduce repeated manual review meetings by turning decisions into batchable calibration runs.</li>
<li>Risk: prevent false confidence by blocking dashboards when unknown evidence exceeds threshold.</li>
<li>Conversion: give project owners a reliable go/no-go signal for AI workflow adoption.</li>
<li>Capacity: release reviewer time by sampling disputes instead of checking every prediction by hand.</li>
</ul>
</div>
<div class="panel">
<h2>Human Capability Improvement</h2>
<ul>
<li>Louis can govern AI reviews with measurable gates, not opinion-only status updates.</li>
<li>zihrou can see which prediction patterns need rubric correction.</li>
<li>iron can identify missing signals and source adapter gaps before automation spreads.</li>
<li>Future workers inherit a clear run state and do not restart discovery from zero.</li>
</ul>
</div>
</section>
<section class="panel">
<h2>Run Control Checklist</h2>
<table>
<thead><tr><th>Gate</th><th>Required Evidence</th><th>Owner</th><th>Pass Rule</th></tr></thead>
<tbody>
<tr><td>Seed lock</td><td><code>seed_queue.status=accepted</code> for 50 items</td><td>Louis</td><td>50 eligible cases, no duplicates</td></tr>
<tr><td>Source sync</td><td>Signals and action items synced after latest review date</td><td>iron</td><td>No stale source over 7 days</td></tr>
<tr><td>Batch labels</td><td><code>calibration_run_item.match_label</code> populated</td><td>PLS worker</td><td>hit/miss/partial/unknown for all cases</td></tr>
<tr><td>Reviewer sample</td><td>At least 5 sampled decisions with reviewer agreement</td><td>zihrou</td><td>Agreement >=80% or disputed cases routed</td></tr>
<tr><td>Unknown control</td><td>Unknown count and reason taxonomy</td><td>Louis</td><td>Unknown &lt;25%; otherwise source gap blocks release</td></tr>
<tr><td>Correction routing</td><td>Every miss/source gap has owner, due, and next action</td><td>Louis</td><td>100% routed before complete</td></tr>
</tbody>
</table>
</section>
<section class="grid cols-3">
<div class="panel">
<h2>Data and API Contract</h2>
<p>Primary entities: <code>calibration_run</code>, <code>calibration_run_item</code>, <code>match_label</code>, <code>reviewer_sample_result</code>, <code>correction_route</code>, <code>run_scorecard</code>.</p>
<p>Worker API: <code>POST /ai-prediction/calibration-runs</code>, <code>POST /items/:id/label</code>, <code>POST /routes</code>, <code>GET /scorecard</code>.</p>
</div>
<div class="panel">
<h2>Permissions and Audit</h2>
<p>Only project owner can start or close a run. Reviewers can update sample outcomes. Worker writes must include evidence source ids, timestamp, model version, and decision-record reference.</p>
</div>
<div class="panel">
<h2>Adoption Upgrade</h2>
<p>Once D7 passes, expose weekly scorecard in PLS project backend. If it fails, dispatch D14 source adapter and rubric correction tasks before dashboard productization.</p>
</div>
</section>
<section class="panel">
<h2>People Sync Draft</h2>
<p>AI 預測驗證已從 Reviewer Inbox 推進到 D7 calibration run。Louis 負責 2026-05-31 前啟動 50-case batch;zihrou/iron 抽樣至少 5 件。驗收是 unknown &lt;25%、reviewer agreement >=80%、所有 miss/source gap 都有 owner/due/next action。未通過不得做 dashboard 產品化。</p>
</section>
</main>
</body>
</html>

Data Model and Application Contract

Entities

calibration_run

Field Type Required Notes
id uuid yes Primary run id.
project_id uuid yes PLS project id.
seed_queue_id uuid yes Accepted 50-case seed queue.
status enum yes draft, ready, running, sample_review, passed, failed, blocked.
owner_user_id uuid yes Louis for this run.
due_at datetime yes 2026-05-31 for D7.
model_version text yes AI prediction model or worker version used.
source_snapshot_at datetime yes Evidence sync boundary.
decision_record_ref text yes Link to decision record.
created_at / updated_at datetime yes Audit timestamps.

calibration_run_item

Field Type Required Notes
id uuid yes Primary item id.
calibration_run_id uuid yes Parent run.
prediction_id uuid yes Prior review prediction.
reviewer_decision_id uuid yes Accepted reviewer inbox decision.
evidence_refs jsonb yes Signals, action items, review notes.
match_label enum yes hit, miss, partial, unknown.
match_confidence decimal yes 0 to 1.
miss_reason enum no bad_prediction, late_signal, missing_source, ambiguous_owner, changed_scope, other.
label_rationale text yes Short evidence-based reason.

reviewer_sample_result

Field Type Required Notes
id uuid yes Primary sample id.
calibration_run_item_id uuid yes Sampled item.
reviewer_user_id uuid yes zihrou or iron.
reviewer_label enum yes agree, disagree, needs_more_evidence.
reviewer_note text no Dispute or confirmation note.
reviewed_at datetime yes Audit timestamp.

correction_route

Field Type Required Notes
id uuid yes Primary route id.
calibration_run_item_id uuid yes Miss or unknown item.
route_type enum yes rubric_fix, source_adapter_gap, owner_followup, model_prompt_fix, ignore_with_reason.
owner_user_id uuid yes Assigned action owner.
due_at datetime yes Due date for correction.
status enum yes open, in_progress, verified, closed.
next_action text yes Concrete follow-up.

run_scorecard

Field Type Required Notes
calibration_run_id uuid yes Parent run.
total_items integer yes Must be 50 for D7 run.
hit_count / miss_count / partial_count / unknown_count integer yes Label counts.
unknown_rate decimal yes Must be <0.25 to pass.
reviewer_sample_count integer yes Must be >=5.
reviewer_agreement_rate decimal yes Target >=0.80.
pass_status enum yes passed, failed, blocked.

API / Worker Contract

Endpoint Method Purpose
/ai-prediction/calibration-runs POST Create D7 run from accepted seed queue.
/ai-prediction/calibration-runs/:id/items/:item_id/label POST Write hit/miss/partial/unknown with evidence refs.
/ai-prediction/calibration-runs/:id/samples POST Assign reviewer sample set.
/ai-prediction/calibration-runs/:id/routes POST Route miss and unknown correction work.
/ai-prediction/calibration-runs/:id/scorecard GET Return scorecard for PLS backend and LINE summary.

Sync Rules

  • Source sync must include signals and action items after the latest prediction review timestamp.
  • Every evidence reference must include source type, source id, source timestamp, and extraction timestamp.
  • Re-runs must create a new calibration_run record linked to the previous run instead of overwriting labels.

Permissions and Audit

  • Project owner can create, start, fail, and close calibration runs.
  • Reviewers can update only reviewer_sample_result.
  • PLS workers can write labels and correction routes only while run status is running or sample_review.
  • All writes require worker id, timestamp, source snapshot id, and decision record reference.
  • No dashboard productization is allowed when unknown_rate >= 0.25 or reviewer_sample_count < 5.

PLS Backend Integration

The PLS backend should show this as a run-level tab under the AI-native project. The first backend view should expose run status, unknown rate, reviewer sample coverage, open correction routes, and the decision record. The worker flow should check calibration_run.status before creating the next job.

Decision Record: D7 Calibration Run Control Tower

Date: 2026-05-24
Status: Recommended for adoption
Owner: Louis
Reviewers: zihrou / iron

Problem

The AI prediction verification chain has enough setup artifacts to start a real calibration run. Without a D7 run control tower, the project risks accumulating documents while never measuring whether previous AI review predictions were true.

Options Considered

Option A: Build dashboard first

Pros: visually satisfying and easier to present.
Cons: dangerous because source gaps and unknown rates are not yet measured. A dashboard could create false confidence.

Option B: Run a small manual sample only

Pros: fast and low effort.
Cons: too small to govern production decisions and does not create worker-readable state for future runs.

Option C: Create D7 50-case calibration run control tower

Pros: turns accepted seeds and reviewer decisions into measurable operating evidence; creates a pass/fail gate, data model, correction routing, and weekly scorecard path.
Cons: requires stricter source sync and reviewer sampling discipline.

Recommendation

Adopt Option C. It is the only option that converts the prior reviewer inbox into an execution layer with measurable pass/fail criteria and a clean D14/D30 path.

Adoption State

Recommended. This pack should be used to start the D7 calibration run once the reviewer inbox is confirmed ready.

Landing Path

  1. Louis confirms 50 accepted seed cases and source freshness.
  2. PLS worker creates calibration_run and labels all 50 items.
  3. zihrou and iron review at least 5 sampled items.
  4. Worker computes scorecard and routes every miss/source gap.
  5. If unknown <25% and reviewer sampling passes, move to D14 correction loop and weekly scorecard design.

Feedback Needed If Not Adopted

If rejected, the next reviewer must specify which gate is not ready:

  • seed queue not accepted,
  • reviewer inbox missing decisions,
  • signals/action item source sync unavailable,
  • unknown threshold unrealistic,
  • no reviewer capacity for sample review,
  • PLS backend cannot store run-level state.

Without that feedback, the project should not proceed to dashboard or productization.

{
"job_id": "7a70ab5d-bfd5-486d-9e96-17fe81064ead",
"project_topic": "AI prediction verification module for signals and action-item evidence",
"current_artifact": "D7 Calibration Run Control Tower",
"owner": "Louis",
"reviewers": ["zihrou", "iron"],
"due": "2026-05-31",
"next_worker_rule": {
"if_no_calibration_run_exists": "Create D7 calibration_run from 50 accepted reviewer-inbox seed cases.",
"if_run_status_ready_or_running": "Execute labels for all 50 cases and calculate unknown rate.",
"if_unknown_rate_gte_25_percent": "Do not build dashboard. Dispatch source_adapter_gap correction tasks.",
"if_reviewer_sample_lt_5": "Request reviewer sampling from zihrou or iron before completion.",
"if_run_passed": "Move to D14 correction loop and weekly scorecard backend/dashboard design."
},
"acceptance_gate": {
"total_items": 50,
"unknown_rate_max": 0.25,
"reviewer_sample_min": 5,
"reviewer_agreement_target": 0.8,
"unrouted_non_hit_count": 0
},
"do_not_repeat": [
"Do not create another generic AI prediction verification concept pack.",
"Do not complete with only text summary.",
"Do not productize dashboard before D7 run has pass evidence."
],
"artifact_files": [
"d7-calibration-run-control-tower.html",
"production-brief.md",
"data-model.md",
"acceptance-tests.md",
"decision-record.md",
"sources.md",
"artifact-url-or-pr.md"
]
}

AI Prediction Verification D7 Calibration Run Control Tower

Scene

The project has already produced a calibration gate, evidence trial runner, 50-case seed queue, and reviewer decision inbox. The next useful production step is not another concept pack. It is the D7 execution control tower that turns accepted reviewer decisions into a measured calibration run.

Owner: Louis
Reviewers: zihrou / iron
Due: 2026-05-31
Primary artifact: d7-calibration-run-control-tower.html

30-Day Path

Day Outcome Acceptance
D1 Confirm reviewer inbox and source readiness. Lock 50 accepted seed cases. 50 eligible cases, no duplicates, source freshness checked.
D7 Execute calibration batch and publish hit/miss/partial/unknown scorecard. Unknown <25%, reviewer sample >=10%, all misses/gaps routed.
D14 Correct repeated miss patterns and source adapter gaps. Correction tasks have owner, due, evidence, and re-run cohort.
D30 Weekly scorecard becomes the operating gate for AI review productization. Threshold history, adoption owner, and dashboard sync are live.

Purpose-to-Purpose E2E

Original purpose: know whether AI review predictions actually became true.

E2E chain:

  1. Previous AI review predictions and reviewer decisions are loaded.
  2. Signals and action items are synced as evidence sources.
  3. A 50-case accepted seed queue is locked into calibration_run.
  4. Worker labels each item hit, miss, partial, or unknown.
  5. Reviewers sample at least 5 items for agreement and dispute handling.
  6. Every miss or unknown is routed to a correction task or source adapter gap.
  7. Louis receives a go/no-go scorecard for productization.

Value and Money Path

This run prevents the organization from spending time or money on unverified AI recommendations. It creates a release gate that protects against false confidence, reduces manual review cycles, prioritizes automation fixes by observed miss taxonomy, and makes AI management decisions measurable enough to support adoption.

Human Capability Improvement

The artifact improves people, not just documents:

  • Louis gets a repeatable governance gate for AI prediction quality.
  • zihrou can review disputed samples and identify rubric weaknesses.
  • iron can see evidence-source gaps before automation expands.
  • Future workers inherit run state and can continue with D14 correction instead of rediscovering context.

Solution Stack

Layer Production Choice
Context framework Prediction verification uses reviewer decisions, signals, action items, and evidence timestamps.
Workflow D1 readiness -> D7 batch labels -> reviewer sample -> correction routing -> scorecard.
Data / DB model calibration_run, calibration_run_item, match_label, reviewer_sample_result, correction_route, run_scorecard.
Operable tool HTML control tower plus structured schema and acceptance tests.
Acceptance indicators 50 cases, unknown <25%, reviewer agreement >=80%, 100% miss/gap routing.
Adoption and upgrade Pass moves to weekly PLS backend scorecard; fail dispatches source adapter or rubric correction tasks.

Owner, Due, Acceptance

Owner: Louis
Due: 2026-05-31
Accept when:

  • A 50-case calibration run exists.
  • Unknown rate is below 25%.
  • At least 5 reviewer samples are checked.
  • Reviewer agreement is at least 80% or disputed cases are routed.
  • Every miss and source gap has owner, due, and next action.
  • Decision record and learning memory are present.

Market and Technical Context

The D7 control tower follows current AI observability and evaluation practice: capture structured evidence, preserve traceable source references, measure outcomes with repeatable scorecards, and block productization when evidence quality is insufficient.

Sources Used

Implication for This Project

The project should treat prediction verification as an evaluation and observability workflow, not as a static report. That means every hit/miss/unknown label must preserve evidence references, source timestamps, reviewer samples, run status, and correction routes.

Production Readiness Interpretation

The artifact is production-ready for the next PLS worker when it can answer:

  • Which run is active?
  • Which evidence snapshot was used?
  • Which predictions hit, missed, partially matched, or remain unknown?
  • Which reviewers sampled the result?
  • Which misses and source gaps have owner, due, and next action?
  • Whether the project can proceed to D14 correction or must fix sources first.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment