|
<!doctype html> |
|
<html lang="zh-Hant"> |
|
<head> |
|
<meta charset="utf-8" /> |
|
<meta name="viewport" content="width=device-width, initial-scale=1" /> |
|
<title>Codex Session Worker Stability Watchdog</title> |
|
<style> |
|
:root { |
|
--ink:#17212b; --muted:#637082; --line:#dbe2ea; --bg:#f5f7f9; --panel:#fff; |
|
--blue:#245f8f; --green:#1f7a5a; --amber:#b77a18; --red:#b13b32; --soft:#eef4f8; |
|
} |
|
*{box-sizing:border-box} body{margin:0;background:var(--bg);color:var(--ink);font-family:Inter,ui-sans-serif,system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;line-height:1.45} |
|
header{padding:28px 36px 22px;background:#fff;border-bottom:1px solid var(--line)} |
|
h1,h2,h3{margin:0;letter-spacing:0} h1{font-size:28px;max-width:1050px} h2{font-size:17px;margin-bottom:12px} h3{font-size:14px;margin-bottom:8px} |
|
p{margin:0;color:var(--muted)} .eyebrow{font-size:12px;font-weight:800;text-transform:uppercase;color:var(--blue);margin-bottom:8px} |
|
.wrap{padding:24px 36px 42px}.grid{display:grid;grid-template-columns:repeat(12,1fr);gap:16px;max-width:1280px;margin:0 auto} |
|
.panel{background:var(--panel);border:1px solid var(--line);border-radius:8px;padding:18px;box-shadow:0 1px 2px rgba(0,0,0,.03)} |
|
.span-12{grid-column:span 12}.span-8{grid-column:span 8}.span-6{grid-column:span 6}.span-4{grid-column:span 4}.span-3{grid-column:span 3} |
|
.kpis{display:grid;grid-template-columns:repeat(4,1fr);gap:12px}.kpi{border:1px solid var(--line);border-radius:8px;padding:14px;min-height:92px;background:#fbfcfd} |
|
.kpi strong{display:block;font-size:28px;margin-bottom:4px}.kpi span{color:var(--muted);font-size:13px} |
|
table{width:100%;border-collapse:collapse;font-size:13px} th,td{text-align:left;padding:10px 8px;border-bottom:1px solid var(--line);vertical-align:top} |
|
th{font-size:12px;text-transform:uppercase;color:var(--muted);background:#fbfcfd}.status{display:inline-flex;border-radius:999px;padding:4px 9px;font-size:12px;font-weight:800;white-space:nowrap} |
|
.green{background:#e2f1ea;color:var(--green)}.amber{background:#f7ecd8;color:#81530f}.red{background:#f9e6e3;color:var(--red)}.blue{background:#e4eef8;color:var(--blue)} |
|
.timeline{display:grid;grid-template-columns:repeat(4,1fr);gap:12px}.stage{border-left:4px solid var(--blue);background:#fbfcfd;padding:12px;border-radius:0 8px 8px 0;min-height:150px} |
|
.stage:nth-child(2){border-color:var(--amber)}.stage:nth-child(3){border-color:var(--green)}.stage:nth-child(4){border-color:var(--red)} |
|
ul{margin:8px 0 0 18px;padding:0;color:var(--muted)}li{margin:4px 0}.flow{display:grid;grid-template-columns:repeat(5,1fr);gap:10px} |
|
.step{border:1px solid var(--line);border-radius:8px;padding:12px;background:#fbfcfd;min-height:118px}.step b{display:block;margin-bottom:6px}.small{font-size:12px;color:var(--muted)} |
|
.code{background:#15202b;color:#e9f0f6;border-radius:8px;padding:12px;font:12px ui-monospace,SFMono-Regular,Menlo,monospace;white-space:pre-wrap;overflow:auto} |
|
.ask{background:var(--soft);border-radius:8px;padding:12px;color:var(--ink);font-size:13px;white-space:pre-wrap} |
|
@media(max-width:900px){header,.wrap{padding-left:18px;padding-right:18px}.span-8,.span-6,.span-4,.span-3{grid-column:span 12}.kpis,.timeline,.flow{grid-template-columns:1fr}} |
|
</style> |
|
</head> |
|
<body> |
|
<header> |
|
<div class="eyebrow">PLS production artifact · codex-session-stability · 2026-05-24</div> |
|
<h1>Codex Session Worker Stability Watchdog</h1> |
|
<p>讓「訊號進來要推進專案與強化 huber 分身」變成可監控、可自修復、可驗收的 worker reliability loop。</p> |
|
</header> |
|
<main class="wrap"> |
|
<section class="grid"> |
|
<div class="panel span-12"> |
|
<div class="kpis"> |
|
<div class="kpi"><strong>11</strong><span>近期 stability 訊號</span></div> |
|
<div class="kpi"><strong>15m</strong><span>lease expiry hard gate</span></div> |
|
<div class="kpi"><strong>3</strong><span>source projects 建議合併</span></div> |
|
<div class="kpi"><strong>D7</strong><span>第一版 stuck queue 自修復驗收</span></div> |
|
</div> |
|
</div> |
|
|
|
<div class="panel span-8"> |
|
<h2>Watchdog Rules</h2> |
|
<table> |
|
<thead><tr><th>Signal</th><th>Threshold</th><th>Action</th><th>Escalation</th><th>Pass Evidence</th></tr></thead> |
|
<tbody> |
|
<tr><td><span class="status red">lease_expired</span></td><td>now > lease_expires_at + 60s</td><td>touch worker, mark stale lease, release or reclaim job</td><td>after 2 failures notify Louis</td><td>job returns to claimable or running with new lease</td></tr> |
|
<tr><td><span class="status amber">queue_stuck</span></td><td>pending/running with no heartbeat > 10m</td><td>diagnose worker_kind, source, contract, last_progress_payload</td><td>create repair proposal if no eligible worker</td><td>queue age decreases within one heartbeat</td></tr> |
|
<tr><td><span class="status blue">claim_loop_idle</span></td><td>3 no-job claims</td><td>inspect eligible kinds and backlog shape</td><td>recommend backlog/capability repair</td><td>next heartbeat has job or explicit repair</td></tr> |
|
<tr><td><span class="status green">artifact_gate</span></td><td>before complete</td><td>verify primary artifact URL HTTP 200 and files listed</td><td>block complete if no openable artifact</td><td>curl/gh output stored in e2e verification</td></tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
<div class="panel span-4"> |
|
<h2>Heartbeat Command Contract</h2> |
|
<div class="code">doctor |
|
touch |
|
claim |
|
context <job_id> |
|
progress <job_id> "正在處理..." reading_context |
|
progress ... building_artifact |
|
verify artifact URL |
|
complete <job_id> summary artifacts_json</div> |
|
</div> |
|
|
|
<div class="panel span-12"> |
|
<h2>D1 / D7 / D14 / D30 Path</h2> |
|
<div class="timeline"> |
|
<div class="stage"><h3>D1 · Instrument</h3><ul><li>Define lease/queue/artifact gates.</li><li>Publish watchdog console and runbook.</li><li>Normalize artifacts_json enum values.</li></ul></div> |
|
<div class="stage"><h3>D7 · Auto-triage</h3><ul><li>Detect stuck jobs and missing worker kinds.</li><li>Generate repair proposal when no eligible task exists.</li><li>Capture verification output before complete.</li></ul></div> |
|
<div class="stage"><h3>D14 · Self-heal</h3><ul><li>Auto-touch or release stale leases under policy.</li><li>Route impossible jobs to pause/fail with reason.</li><li>Score worker health by role.</li></ul></div> |
|
<div class="stage"><h3>D30 · Reliability OS</h3><ul><li>Dashboard enters PLS ops console.</li><li>Worker pools have SLO, audit and rollback.</li><li>Huber clone capability improves from evidence.</li></ul></div> |
|
</div> |
|
</div> |
|
|
|
<div class="panel span-12"> |
|
<h2>Purpose-to-Purpose E2E</h2> |
|
<div class="flow"> |
|
<div class="step"><b>Signal</b><span class="small">E2E test says signals must advance projects and strengthen huber clones.</span></div> |
|
<div class="step"><b>Artifact</b><span class="small">Watchdog console, rules, schema, runbook and acceptance gates.</span></div> |
|
<div class="step"><b>Adoption</b><span class="small">Worker follows command contract and records phase-specific progress.</span></div> |
|
<div class="step"><b>Reliability</b><span class="small">Queue stuck, lease expiry and artifact failures are caught before silent drift.</span></div> |
|
<div class="step"><b>Value</b><span class="small">Less manual supervision, fewer failed completions, more reliable AI project throughput.</span></div> |
|
</div> |
|
</div> |
|
|
|
<div class="panel span-6"> |
|
<h2>Market Maturity Fit</h2> |
|
<p>Google SRE monitoring guidance frames monitoring as production visibility for urgent, actionable symptoms. Google SRE automation guidance makes reliability work scalable by replacing repeatable toil with engineered automation. This watchdog applies those practices to PLS workers: only alert on actionable stuck states, and automate safe recovery before human escalation.</p> |
|
<ul> |
|
<li>https://sre.google/resources/book-update/monitoring-distributed-systems/</li> |
|
<li>https://sre.google/resources/book-update/automation-at-google/</li> |
|
</ul> |
|
</div> |
|
|
|
<div class="panel span-6"> |
|
<h2>Production Readiness</h2> |
|
<table> |
|
<tbody> |
|
<tr><th>Data</th><td>worker_heartbeats, job_leases, watchdog_incidents, artifact_verifications</td></tr> |
|
<tr><th>API</th><td>GET queue health, POST incident, PATCH lease, POST verification</td></tr> |
|
<tr><th>Permissions</th><td>watchdog can diagnose; release/reclaim needs policy; destructive repair forbidden</td></tr> |
|
<tr><th>Audit</th><td>Every auto-action stores old state, new state, command output, actor and reason</td></tr> |
|
<tr><th>Rollback</th><td>Self-heal actions are reversible: restore previous owner/lease or pause with reason</td></tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
<div class="panel span-4"> |
|
<h2>Acceptance</h2> |
|
<p><b>Owner:</b> PLS platform / Louis.</p> |
|
<p><b>Due:</b> D7 for first auto-triage proof.</p> |
|
<p><b>Pass:</b> one simulated stale lease, one no-job loop, and one artifact-gate failure are detected with evidence.</p> |
|
</div> |
|
<div class="panel span-4"> |
|
<h2>Project Merge Rule</h2> |
|
<p>Merge `360bdd66`, `9ae8f0da`, and `d2afbba2` into one stability project because claim/lease/queue stuck behavior is a shared platform reliability concern.</p> |
|
</div> |
|
<div class="panel span-4"> |
|
<h2>People Sync</h2> |
|
<div class="ask">請確認這個 watchdog 的 D7 驗收要不要接進 PLS ops: |
|
1. stale lease 是否允許自動 release/reclaim? |
|
2. queue stuck alert 要推 LINE 還是只進 dashboard? |
|
3. artifact gate fail 時,worker 應 pause 還是 retry 一次?</div> |
|
</div> |
|
</section> |
|
</main> |
|
</body> |
|
</html> |