|
<!doctype html> |
|
<html lang="zh-Hant"> |
|
<head> |
|
<meta charset="utf-8"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
<title>Codex Worker Self-Repair Watchdog Console</title> |
|
<style> |
|
:root{--ink:#111417;--paper:#f4f1e9;--panel:#fffaf0;--line:#20242a;--red:#b83b32;--green:#187354;--blue:#245f8d;--gold:#b67916;--muted:#676b70} |
|
*{box-sizing:border-box}body{margin:0;color:var(--ink);background:linear-gradient(90deg,rgba(36,95,141,.08) 1px,transparent 1px) 0 0/42px 42px,linear-gradient(rgba(17,20,23,.05) 1px,transparent 1px) 0 0/42px 42px,var(--paper);font-family:ui-serif,Georgia,"Times New Roman",serif} |
|
header{min-height:86vh;display:grid;grid-template-columns:1.1fr .9fr;gap:34px;align-items:end;padding:56px clamp(20px,5vw,84px) 42px;border-bottom:3px solid var(--line)} |
|
h1{font-size:clamp(46px,8vw,112px);line-height:.88;margin:18px 0 22px;letter-spacing:0}.tag{display:inline-block;background:var(--ink);color:var(--paper);border:2px solid var(--line);padding:8px 12px;font:800 13px ui-monospace,SFMono-Regular,Menlo,monospace;text-transform:uppercase}.lead{font-size:clamp(18px,2vw,27px);line-height:1.38;max-width:820px;color:#30343a} |
|
.board{border:3px solid var(--line);background:var(--panel);box-shadow:12px 12px 0 var(--line);padding:20px;display:grid;gap:14px}.stat{border:2px solid var(--line);background:#fff;padding:15px}.stat b{display:block;font:900 46px/1 ui-monospace,SFMono-Regular,Menlo,monospace}.stat span{font:800 12px ui-monospace,SFMono-Regular,Menlo,monospace;color:var(--muted);text-transform:uppercase} |
|
main{padding:32px clamp(18px,4vw,64px) 78px}.grid{display:grid;grid-template-columns:repeat(12,1fr);gap:18px}section{border:2px solid var(--line);background:rgba(255,250,240,.94);padding:20px}.span-12{grid-column:span 12}.span-8{grid-column:span 8}.span-6{grid-column:span 6}.span-4{grid-column:span 4} |
|
h2{margin:0 0 14px;font-size:29px}p,li{line-height:1.55}.flow{display:grid;grid-template-columns:repeat(5,1fr);gap:12px}.step,.action{border:2px solid var(--line);background:#fff;padding:14px;min-height:142px}.step b{display:block;font:900 18px ui-monospace,SFMono-Regular,Menlo,monospace;margin-bottom:8px} |
|
table{width:100%;border-collapse:collapse;background:#fff}th,td{border:1px solid var(--line);padding:10px;text-align:left;vertical-align:top}th{background:#e4edf3;font-family:ui-monospace,SFMono-Regular,Menlo,monospace}.badge{display:inline-block;border:2px solid var(--line);padding:5px 9px;background:#fff;font:800 12px ui-monospace,SFMono-Regular,Menlo,monospace}.green{background:var(--green);color:#fff}.gold{background:var(--gold);color:#fff}.blue{background:var(--blue);color:#fff}.red{background:var(--red);color:#fff} |
|
.actions{display:grid;grid-template-columns:repeat(3,1fr);gap:12px}@media(max-width:900px){header,.flow,.actions{grid-template-columns:1fr}.span-4,.span-6,.span-8,.span-12{grid-column:span 12}h1{font-size:52px}} |
|
</style> |
|
</head> |
|
<body> |
|
<header> |
|
<div><span class="tag">Codex Session / Worker Stability</span><h1>讓 worker 壞掉以前,先自己留下可修復證據。</h1><p class="lead">本控制台把 Codex Session / worker 穩定與自修復做成 production watchdog:監測 claim/context/progress/complete、lease 過期、queue stuck、500 fetch failed、artifact gate,並把自癒動作、升級規則、稽核與 D30 驗收接到 PLS。</p></div> |
|
<aside class="board"><div class="stat"><span>Watchdog Scope</span><b>5 paths</b><p>doctor、touch、claim、context/progress、upload/complete。</p></div><div class="stat"><span>D30 Goal</span><b>80%</b><p>stuck job 自動分類與修復建議覆蓋率。</p></div><div class="stat"><span>Human Page</span><b>0 false green</b><p>沒有 open artifact 或 verified output,不得顯示成功。</p></div></aside> |
|
</header> |
|
<main><div class="grid"> |
|
<section class="span-12"><h2>D1 / D7 / D14 / D30</h2><div class="flow"><div class="step"><b>D1</b>定義 lease、queue、helper command、artifact gate 的 watchdog 指標與門檻。</div><div class="step"><b>D7</b>交付控制台、runbook、data model、acceptance tests;先用人工 helper output 驗證。</div><div class="step"><b>D14</b>串 PLS job events,回填 3 種 incident:fetch failed、lease expired、artifact gate fail。</div><div class="step"><b>D30</b>決定續行、升級 system/watchdog agent,或換 owner。</div><div class="step"><b>Loop</b>每次 stuck 自動產生 repair action 與 learning memory。</div></div></section> |
|
<section class="span-8"><h2>Watchdog Rules</h2><table><tr><th>Signal</th><th>Threshold</th><th>Auto Repair</th><th>Escalate</th></tr><tr><td>claim no output</td><td>30s no JSON</td><td>touch + retry claim once</td><td>2 次失敗通知 owner</td></tr><tr><td>context/progress fetch failed</td><td>500 TypeError</td><td>touch + sleep 3 + retry</td><td>保留 claim payload 繼續 build</td></tr><tr><td>lease near expiry</td><td>< 3 min</td><td>progress heartbeat</td><td>若上傳中斷,mark stuck</td></tr><tr><td>queue stuck</td><td>running > lease + 5 min</td><td>release/reclaim policy</td><td>supervisor review</td></tr><tr><td>artifact gate fail</td><td>no HTTP 200 / no file list</td><td>block complete</td><td>mark stuck with evidence</td></tr></table></section> |
|
<section class="span-4"><h2>Solution</h2><p><span class="badge red">watchdog</span> 核心交付。</p><p><span class="badge blue">system</span> 需要 DB/API/稽核。</p><p><span class="badge green">runbook</span> 自修復操作。</p><p><span class="badge gold">eval</span> pass/fail 驗收。</p></section> |
|
<section class="span-6"><h2>Self-Repair Runbook</h2><table><tr><th>狀況</th><th>固定處理</th></tr><tr><td>fetch failed</td><td>touch → sleep 3 → retry once → 保存輸出</td></tr><tr><td>complete gate fail</td><td>改 artifacts_json 為 shared_cloud Gist URL,不用 pls_file 作 primary</td></tr><tr><td>no job</td><td>讀 doctor/touch/claim,產 backlog repair proposal</td></tr><tr><td>stale lease</td><td>標記 stuck reason,避免雙 worker complete</td></tr></table></section> |
|
<section class="span-6"><h2>Production Path</h2><table><tr><th>Layer</th><th>Spec</th></tr><tr><td>Data</td><td>worker_runs、job_leases、helper_events、repair_actions、artifact_checks</td></tr><tr><td>API</td><td>POST health, POST repair-action, PATCH lease, POST artifact-check</td></tr><tr><td>Permission</td><td>Worker 可建議修復;Supervisor 可釋放/重派;System 可重試低風險命令</td></tr><tr><td>Audit</td><td>每次 retry、release、complete 都留 command output hash</td></tr></table></section> |
|
<section class="span-12"><h2>People Sync</h2><div class="actions"><div class="action"><b>給 PLS owner</b><p>請確認 watchdog 門檻:claim 30s、lease <3min、queue stuck lease+5min、complete gate 必須 HTTP 200 與 file list。</p></div><div class="action"><b>給 worker operator</b><p>遇到 500 fetch failed 時不要放棄:touch、sleep 3、retry;若 artifact gate 缺 open URL,不得 complete。</p></div><div class="action"><b>升級句</b><p>若 2 次 retry 仍失敗,回填 stuck reason 與完整 command output,交 supervisor 決定 release/reclaim。</p></div></div></section> |
|
</div></main> |
|
</body></html> |