Skip to content

Instantly share code, notes, and snippets.

@esz135888
Created May 23, 2026 19:25
Show Gist options
  • Select an option

  • Save esz135888/2a5302a2239b41e4398fed68842a29c2 to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/2a5302a2239b41e4398fed68842a29c2 to your computer and use it in GitHub Desktop.
PLS job a2b47d9e AI prediction verification calibration gate

Acceptance Tests And E2E Verification

Artifact kind: e2e_verification.

Gate Tests

Test Method Pass Criteria
Label policy accepted Review decision record and owner sync. Louis has accepted hit/miss/unknown policy and miss taxonomy.
Seed set complete Query prediction_claim seed list. At least 10 seed predictions have owner, due, confidence, expected evidence, impact metric.
Batch trial complete Run matcher on 50 claims. 50 labels created with evidence provenance or explicit unknown reason.
Unknown rate threshold Compute unknown / total. Unknown rate below 25% by D7; below 15% by D30.
Reviewer sample Sample match records. At least 10% reviewed by Louis, zihrou, iron, or delegate.
Correction routing Inspect misses. Every repeated miss reason has a correction task with owner, due, and acceptance.
Duplicate dispatch prevention Simulate repeat job request. Dispatcher routes to next missing gate instead of creating another generic build.
Audit evidence Inspect evidence events. Every label references source refs or an explicit source gap.

E2E Verification Scenario

  1. Create 10 historical predictions from prior AI review notes.
  2. Sync evidence from action items, signals, commits, deployments, worker completions, and human review notes.
  3. Run matcher and produce labels.
  4. Human reviewer samples one hit, one miss, and one unknown.
  5. Create correction tasks for repeated miss reasons.
  6. Produce calibration summary with next gate.
  7. Attempt duplicate production dispatch and confirm it is blocked or redirected.

Expected result: PLS can answer whether the AI review loop is becoming more predictive and what operational correction is needed next.

Verification Completed In This Pack

  • Created a primary HTML gate artifact with D1/D7/D14/D30, E2E, value path, people sync, and solution stack.
  • Created production brief, data model, acceptance tests, decision record, source notes, and learning memory.
  • Local JSON validation must pass for learning-memory.json.
  • Durable Gist URL must be verified before PLS completion.

Owner / Due / Acceptance

  • Owner: Louis.
  • Reviewers: zihrou and iron.
  • D1 due: 2026-05-27.
  • D7 due: 2026-05-31.
  • Acceptance: label policy accepted, 10 seeds selected, 50-case batch labeled, unknown below 25%, reviewer sample complete, correction task opened, decision record present.
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>AI Prediction Verification Calibration Gate</title>
<style>
:root {
--ink: #171717;
--muted: #66615b;
--line: #d8d1c7;
--paper: #faf8f3;
--panel: #ffffff;
--amber: #d28a16;
--blue: #1e5d8f;
--green: #19705f;
--red: #b14d42;
--shadow: 0 16px 40px rgba(38, 33, 26, 0.10);
}
* { box-sizing: border-box; }
body {
margin: 0;
background: var(--paper);
color: var(--ink);
font-family: ui-serif, Georgia, "Times New Roman", serif;
line-height: 1.45;
}
header {
padding: 42px 6vw 30px;
border-bottom: 1px solid var(--line);
background: linear-gradient(90deg, #fffdf8 0%, #f5efe4 100%);
}
.eyebrow {
color: var(--blue);
font: 700 12px/1.2 ui-monospace, SFMono-Regular, Menlo, monospace;
letter-spacing: .08em;
text-transform: uppercase;
}
h1 {
max-width: 980px;
margin: 12px 0 12px;
font-size: clamp(36px, 6vw, 76px);
line-height: .95;
letter-spacing: 0;
}
.lede {
max-width: 900px;
color: var(--muted);
font-size: 20px;
}
main {
padding: 28px 6vw 56px;
display: grid;
gap: 22px;
}
section {
background: var(--panel);
border: 1px solid var(--line);
border-radius: 8px;
box-shadow: var(--shadow);
padding: 24px;
}
h2 {
margin: 0 0 16px;
font-size: 24px;
}
.grid {
display: grid;
grid-template-columns: repeat(4, minmax(0, 1fr));
gap: 14px;
}
.two {
display: grid;
grid-template-columns: minmax(0, 1fr) minmax(0, 1fr);
gap: 16px;
}
.card {
border: 1px solid var(--line);
border-radius: 8px;
padding: 16px;
background: #fffdf8;
}
.tag {
display: inline-block;
border: 1px solid var(--line);
border-radius: 999px;
padding: 3px 9px;
margin-bottom: 10px;
color: var(--muted);
font: 700 11px/1.2 ui-monospace, SFMono-Regular, Menlo, monospace;
}
ul, ol { margin: 0; padding-left: 20px; }
li { margin: 7px 0; }
table {
width: 100%;
border-collapse: collapse;
font-size: 15px;
}
th, td {
border-bottom: 1px solid var(--line);
padding: 10px 8px;
text-align: left;
vertical-align: top;
}
th {
color: var(--blue);
font: 700 12px/1.2 ui-monospace, SFMono-Regular, Menlo, monospace;
text-transform: uppercase;
letter-spacing: .04em;
}
.status-pass { color: var(--green); font-weight: 700; }
.status-watch { color: var(--amber); font-weight: 700; }
.status-stop { color: var(--red); font-weight: 700; }
code {
background: #f2ece1;
border: 1px solid var(--line);
border-radius: 5px;
padding: 1px 5px;
font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
font-size: .92em;
}
@media (max-width: 920px) {
.grid, .two { grid-template-columns: 1fr; }
header, main { padding-left: 18px; padding-right: 18px; }
}
</style>
</head>
<body>
<header>
<div class="eyebrow">PLS purpose_e2e_toolbox_v2 / primary_artifact</div>
<h1>AI Prediction Verification Calibration Gate</h1>
<p class="lede">This pack prevents another vague rebuild of the prediction verification module. It defines the gate that decides whether the team should accept the label policy, seed evidence, run a batch trial, or reopen data-source work before any new engineering cycle is dispatched.</p>
</header>
<main>
<section>
<h2>Thirty Day Path</h2>
<div class="grid">
<div class="card"><span class="tag">D1</span><strong>Policy lock</strong><br>Louis accepts the hit/miss/unknown label policy and selects 10 prior AI review predictions as seed cases.</div>
<div class="card"><span class="tag">D7</span><strong>Batch trial</strong><br>50 predictions are auto-labeled from signals, action items, commits, worker logs, and review notes. Unknown rate must be below 25%.</div>
<div class="card"><span class="tag">D14</span><strong>Correction routing</strong><br>Miss reasons route to owners: direction gap, evidence gap, resource gap, authorization gap, or execution drift.</div>
<div class="card"><span class="tag">D30</span><strong>Review operating loop</strong><br>Calibration score becomes part of the weekly company AI review, with trend, owner, risk, and next action visible in PLS.</div>
</div>
</section>
<section>
<h2>Purpose To Purpose E2E</h2>
<div class="two">
<div class="card">
<span class="tag">Flow</span>
<ol>
<li>AI review produces a prediction with owner, due date, confidence, and expected evidence.</li>
<li>PLS collects signals, action items, commits, deployment logs, worker completions, and human review notes.</li>
<li>The matcher assigns <code>hit</code>, <code>miss</code>, or <code>unknown</code> plus evidence links.</li>
<li>Human reviewer samples labels and records override reasons.</li>
<li>PLS opens correction tasks for repeated miss reasons.</li>
<li>Next reviews use calibrated confidence, fewer vague predictions, and better owner routing.</li>
</ol>
</div>
<div class="card">
<span class="tag">Measurable end state</span>
<ul>
<li>Every reviewed prediction has evidence provenance or an explicit data gap.</li>
<li>Unknown labels fall below 25% by D7 and below 15% by D30.</li>
<li>Miss reasons create action items instead of passive commentary.</li>
<li>Weekly review decisions show whether AI predictions improved project, money, or risk indicators.</li>
</ul>
</div>
</div>
</section>
<section>
<h2>Preflight Decision Gate</h2>
<table>
<thead><tr><th>Condition</th><th>Decision</th><th>Owner</th><th>Acceptance</th></tr></thead>
<tbody>
<tr><td>Prior production pack exists but label policy is not accepted.</td><td class="status-stop">Stop rebuild</td><td>Louis</td><td>Accept label policy and seed set before engineering.</td></tr>
<tr><td>Policy accepted but fewer than 10 seed predictions exist.</td><td class="status-watch">Seed first</td><td>Louis + zihrou</td><td>10 predictions with expected evidence and review date.</td></tr>
<tr><td>Seed set exists but D7 batch has not run.</td><td class="status-watch">Run batch trial</td><td>iron</td><td>50 labels, unknown below 25%, reviewer sample completed.</td></tr>
<tr><td>Unknown rate above 25%.</td><td class="status-stop">Open data-source gap</td><td>iron</td><td>Missing source mapped to API/sync owner and due date.</td></tr>
<tr><td>Hit/miss labels stable and correction routes active.</td><td class="status-pass">Proceed to productization</td><td>Louis</td><td>D14 correction routing and D30 review dashboard accepted.</td></tr>
</tbody>
</table>
</section>
<section>
<h2>Value And Money Path</h2>
<div class="two">
<div class="card">
<span class="tag">Economic logic</span>
<ul>
<li>Revenue: better AI review predictions identify accounts and internal projects that are ready for monetizable delivery.</li>
<li>Cost: fewer repeated AI work dispatches when the real blocker is policy, seed data, or adoption.</li>
<li>Risk: false confidence is exposed before it shapes staffing, roadmap, or client commitments.</li>
<li>Conversion: prediction evidence gives teams a clearer reason to adopt AI operating routines.</li>
</ul>
</div>
<div class="card">
<span class="tag">Human capability lift</span>
<ul>
<li>Louis gets a calibration view instead of another text-only status report.</li>
<li>zihrou can separate direction, resource, and authorization misses.</li>
<li>iron can see exact evidence gaps for worker, repo, and signal ingestion.</li>
<li>Project owners learn to write predictions that are testable, not performative.</li>
</ul>
</div>
</div>
</section>
<section>
<h2>Solution Stack</h2>
<table>
<thead><tr><th>Layer</th><th>Production decision</th><th>Artifact in pack</th></tr></thead>
<tbody>
<tr><td>Context framework</td><td>Prediction is a claim with expected evidence, owner, due date, confidence, and review window.</td><td><code>production-brief.md</code></td></tr>
<tr><td>Workflow</td><td>Policy accept -> seed -> batch match -> reviewer sample -> correction route -> review dashboard.</td><td>This HTML gate</td></tr>
<tr><td>Data model</td><td>Prediction, evidence event, match result, reviewer override, correction task, calibration summary.</td><td><code>data-model.md</code></td></tr>
<tr><td>Tool/app</td><td>PLS dispatcher preflight that blocks duplicate build jobs until acceptance gates are met.</td><td>This HTML gate</td></tr>
<tr><td>Acceptance</td><td>Unknown threshold, reviewer sample rate, owner/due, and correction routing are testable.</td><td><code>acceptance-tests.md</code></td></tr>
<tr><td>Adoption upgrade</td><td>Weekly scorecard and learning memory tell the next worker what must happen next.</td><td><code>learning-memory.json</code></td></tr>
</tbody>
</table>
</section>
<section>
<h2>Owner, Due, Acceptance</h2>
<ul>
<li>Owner: Louis. Reviewers: zihrou and iron.</li>
<li>Due: 2026-05-27 for policy acceptance and 10 seed predictions; 2026-05-31 for first 50-case batch trial.</li>
<li>Acceptance: label policy accepted, seed set complete, D7 unknown rate below 25%, reviewer sample finished, decision record present, and next correction task opened.</li>
<li>People sync: send only short LINE summary; primary durable artifact is this pack and its Gist URL.</li>
</ul>
</section>
</main>
</body>
</html>

Artifact URL Or PR

Durable primary artifact:

https://gist.github.com/esz135888/2a5302a2239b41e4398fed68842a29c2

Verification:

  • Local JSON validation passed for learning-memory.json.
  • Required artifact kind anchors exist: market_context, production_readiness, e2e_verification, people_sync, learning_memory.
  • Gist URL must return HTTP 200 after redirect before PLS completion.

Data Model And Integration Contract

Artifact kind: production_readiness.

Tables

prediction_claim

Field Type Required Notes
id uuid yes Stable prediction id.
project_id uuid yes PLS project or AI-native project id.
review_id uuid yes Source AI review.
claim_text text yes Testable prediction statement.
owner_person_id text yes Human accountable owner.
due_at timestamptz yes Date when evidence should exist.
confidence numeric yes 0 to 1 model confidence.
expected_evidence jsonb yes Source types and matching hints.
impact_metric text yes Revenue, cost, risk, conversion, labor, or delivery metric.
status enum yes active, ready_for_match, matched, archived.

evidence_event

Field Type Required Notes
id uuid yes Evidence event id.
source_type enum yes signal, action_item, github_commit, deployment, worker_completion, line_note, drive_doc, review_note.
source_ref text yes URL or source id.
event_at timestamptz yes When evidence happened.
actor_person_id text no Human or worker actor.
project_id uuid no Joined project if known.
payload jsonb yes Raw normalized evidence.
audit_hash text yes Tamper-evident hash of normalized payload.

prediction_match

Field Type Required Notes
id uuid yes Match id.
prediction_claim_id uuid yes Linked claim.
label enum yes hit, miss, unknown.
label_confidence numeric yes 0 to 1.
evidence_event_ids uuid[] no Supporting events.
miss_reason enum no direction_gap, evidence_gap, resource_gap, authorization_gap, execution_drift, timing_gap.
matcher_version text yes Model/rule version.
created_at timestamptz yes Match time.

reviewer_override

Field Type Required Notes
id uuid yes Override id.
prediction_match_id uuid yes Match reviewed.
reviewer_person_id text yes Louis, zihrou, iron, or delegate.
override_label enum no hit, miss, unknown.
override_reason text yes Why the machine label was accepted or changed.
created_at timestamptz yes Review time.

correction_task

Field Type Required Notes
id uuid yes PLS action item id.
prediction_match_id uuid yes Source miss.
owner_person_id text yes Responsible owner.
due_at timestamptz yes Correction due date.
task_type enum yes revise_prediction, add_source_adapter, clarify_direction, add_resource, authorize_decision, fix_execution.
acceptance text yes Completion condition.
status enum yes open, blocked, done, cancelled.

calibration_summary

Field Type Required Notes
id uuid yes Summary id.
window_start date yes Reporting window.
window_end date yes Reporting window.
total_predictions integer yes Count.
hit_rate numeric yes Hits / matched.
miss_rate numeric yes Misses / matched.
unknown_rate numeric yes Unknown / total.
top_miss_reason text no Repeated issue.
next_gate enum yes policy_acceptance, seed_predictions, batch_trial, data_gap, correction_routing, dashboard_rollout.

API Contract

POST /api/prediction-verification/claims/import
POST /api/prediction-verification/evidence/sync
POST /api/prediction-verification/matches/run
POST /api/prediction-verification/reviewer-overrides
POST /api/prediction-verification/correction-tasks
GET  /api/prediction-verification/calibration-summary?project_id=...

Sync And Audit Boundary

  • Sync frequency: daily for review notes and action items; near-real-time for worker completions and GitHub events where webhook coverage exists.
  • Permissions: owners can view project claims; reviewers can override labels; admins can edit source adapters; workers can append evidence but not mutate reviewer decisions.
  • Audit: every evidence event stores source_ref, normalized payload, timestamp, actor, and hash. Reviewer overrides are append-only.
  • PLS integration: dispatcher reads calibration_summary.next_gate before creating another solution_build job for the same module.

Duplicate Dispatch Rule

If the same project asks for prediction verification again:

  1. Check if a prior production pack exists.
  2. Check if label policy is accepted.
  3. Check if seed predictions exist.
  4. Check if D7 batch trial exists.
  5. Dispatch the next missing gate only.

This avoids rebuilding the same module while the adoption or data gate is still unresolved.

Decision Record

Decision

Adopt an AI Prediction Verification Calibration Gate as the next production artifact instead of rebuilding the same prediction verification module.

Context

The current job asks for a module that verifies whether prior AI review predictions were correct by matching them against signals, action items, and other evidence. A prior production pack already covered the broad verification cockpit. The repeated request indicates the next bottleneck is likely not another draft but a gate that forces policy acceptance, seed selection, batch verification, correction routing, and dispatcher preflight.

Options Considered

Option A: Build another full verification module pack

  • Pros: directly matches the literal request.
  • Cons: duplicates prior work and does not resolve adoption, seed data, or dispatcher repeat risk.

Option B: Build only a dashboard mock

  • Pros: easy to view.
  • Cons: weak on data model, API, audit, ownership, and measurable acceptance.

Option C: Build calibration gate and preflight contract

  • Pros: advances the project from artifact production to adoption and measurable use; blocks duplicate dispatch; creates owner/due/acceptance; supports PLS backend and worker routing.
  • Cons: requires Louis, zihrou, and iron to accept policy and seed cases before productization.

Recommendation

Choose Option C.

Adoption Status

Recommended for D1 acceptance. The pack is ready for Louis to accept the label policy and seed 10 prior predictions. Engineering productization should wait until the D1 gate is complete.

Landing Path

  1. Louis accepts the label taxonomy and seed criteria.
  2. Louis and zihrou select 10 prior predictions.
  3. iron confirms evidence sources and missing adapters.
  4. PLS runs the 50-case D7 batch trial.
  5. Repeated misses create correction tasks.
  6. Weekly company AI review includes calibration score.

Feedback Needed If Not Adopted

If the team rejects this direction, the required feedback is:

  • Which prior production pack is considered insufficient?
  • Which label policy or miss taxonomy is wrong?
  • Which source of truth should override PLS evidence?
  • Is the desired next step a UI, backend API, data ingestion adapter, or management operating rule?

Without this feedback, the next worker should not create another generic verification artifact.

{
"job_id": "a2b47d9e-e9b0-4895-9316-70205f378c54",
"project": "AI native project: company AI maximization",
"topic": "AI prediction verification",
"memory_type": "learning_memory",
"next_worker_instruction": "Before building another AI prediction verification module, check the calibration gate. If label policy is not accepted, route to policy_acceptance. If fewer than 10 seed predictions exist, route to seed_predictions. If no 50-case trial exists, route to batch_trial. If unknown rate is above 25%, route to data_source_gap. Only productize dashboard/backend after correction routing is active.",
"owners": {
"primary": "Louis",
"reviewers": ["zihrou", "iron"]
},
"acceptance": [
"Label policy accepted",
"10 seed predictions selected",
"50-case batch trial completed",
"Unknown rate below 25% by D7",
"Reviewer sample completed",
"Repeated miss reasons create correction tasks",
"Decision record remains attached"
],
"artifact_files": [
"ai-prediction-calibration-gate.html",
"production-brief.md",
"data-model.md",
"acceptance-tests.md",
"decision-record.md",
"sources.md",
"artifact-url-or-pr.md"
]
}

AI Prediction Verification Calibration Gate - Production Brief

Scene

The project signal asks to add an AI prediction verification module that checks whether the last review's predictions were correct by using signals, action items, commits, worker logs, and other evidence. A prior production pack already defined the general verification cockpit. This round should not produce another loose module draft. It should add a calibration and dispatcher preflight gate so PLS knows when to build, when to seed evidence, and when to route a human decision.

D1 / D7 / D14 / D30 Path

Horizon Outcome Owner Acceptance
D1 Label policy accepted and 10 prior predictions selected as seed cases. Louis Each seed has owner, due date, confidence, expected evidence, and review window.
D7 50 predictions are batch-labeled from available PLS evidence. iron Unknown rate below 25%; 10% reviewer sample completed.
D14 Miss reasons create correction tasks. zihrou + iron Misses are categorized as direction gap, evidence gap, resource gap, authorization gap, or execution drift.
D30 Calibration becomes part of weekly company AI review. Louis Dashboard shows hit rate, unknown rate, repeated miss reason, owner, due date, and money/risk impact.

Purpose To Purpose E2E

Raw purpose: make company AI reviews more truthful by checking whether previous AI predictions actually happened.

Production chain:

  1. AI review emits prediction claims with owner, due date, confidence, expected evidence, and impact metric.
  2. PLS ingests evidence from signals, action items, GitHub commits, deployments, worker completions, LINE/Drive references, and human review notes.
  3. Matcher labels each claim as hit, miss, or unknown, with evidence links and confidence.
  4. Human reviewer samples labels and records override reasons.
  5. Repeated misses open correction tasks instead of becoming passive commentary.
  6. Weekly review uses calibration to improve decisions, staffing, project priority, and risk control.

Testable end purpose: the team can prove whether AI predictions improved project delivery, revenue readiness, risk reduction, or human decision quality.

Value And Money Path

  • Revenue: calibrated predictions identify which AI projects, accounts, or operating changes are ready to convert into sellable delivery.
  • Cost: duplicate worker dispatches drop because PLS checks whether the blocker is policy, seed data, or adoption before building again.
  • Risk: false confidence is caught before it shapes staffing, roadmap, or customer commitments.
  • Conversion: evidence-backed predictions help humans trust and adopt AI operating rhythms.
  • Labor leverage: reviewers spend time on exceptions and correction routes, not manual status archaeology.

Human Capability Improvement

This artifact is designed to make people better operators:

  • Louis can inspect calibration and decide whether the AI review loop is trustworthy.
  • zihrou can distinguish wrong direction from missing evidence or missing authorization.
  • iron can see which APIs, syncs, worker logs, or repo events are missing before another implementation sprint.
  • Project owners learn to write predictions that have measurable evidence and consequences.

Solution Stack

Layer Decision
Context framework A prediction is not a sentence; it is a claim with owner, due date, evidence expectation, confidence, and review window.
Workflow Accept label policy -> seed 10 predictions -> batch match 50 -> reviewer sample -> route corrections -> weekly calibration scorecard.
Data/DB model prediction_claim, evidence_event, prediction_match, reviewer_override, correction_task, calibration_summary.
API/sync Ingest signals, action items, commits, deployments, worker completions, and human review notes through source adapters with audit trails.
Tool/app Dispatcher preflight gate blocks duplicate production builds until the next needed gate is satisfied.
Acceptance Unknown rate, sample rate, owner/due completeness, correction routing, and dashboard readiness are measurable.
Adoption Short people sync plus weekly scorecard; durable pack remains in shared artifact URL.

Production Readiness

Readiness status: ready for D1 policy and seed acceptance; not ready for another build job until D1 is complete.

Primary go/no-go rule: if label policy and seed cases are absent, route to adoption/preflight work, not engineering rebuild work.

People Sync

Artifact kind: people_sync.

  • Louis: accept label policy and nominate 10 seed predictions by 2026-05-27.
  • zihrou: review miss taxonomy and decide which miss reasons require management action.
  • iron: confirm available evidence sources and list missing source adapters before D7 batch.

Next Round Upgrade

Run the first 50-case batch trial and publish the calibration summary. If unknown rate is above 25%, open a data-source gap job. If it is below 25%, move to D14 correction routing.

Market And Technical Context Sources

Artifact kind: market_context.

This pack uses current observability and ML monitoring patterns as market context. The core inference is that prediction verification should be treated as an evidence-backed observability and calibration loop, not just a report.

Sources

Context Applied

  • Modern AI/ML systems need monitoring, drift detection, and evidence trails.
  • Agentic work needs observability across actions, tools, traces, and outcomes.
  • PLS prediction verification should therefore combine claims, evidence events, match labels, reviewer overrides, and correction tasks.
  • The business value is not the label itself; it is the management correction that follows repeated misses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment