Skip to content

Instantly share code, notes, and snippets.

@esz135888
Last active May 24, 2026 03:44
Show Gist options
  • Select an option

  • Save esz135888/e36b37371330f2e74e73af799dd7f725 to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/e36b37371330f2e74e73af799dd7f725 to your computer and use it in GitHub Desktop.
PLS Codex worker self-repair watchdog production pack

Acceptance Tests

Primary Artifact

  • PASS:codex-worker-self-repair-watchdog-console.html 可用公開 URL 開啟。
  • FAIL:只有 Markdown runbook,沒有 dashboard/tool primary。

Watchdog

  • PASS:claim 無輸出 30 秒會 touch + retry。
  • PASS:context/progress/complete 出現 500 TypeError fetch failed 會 touch + sleep 3 + retry once。
  • PASS:lease 小於 3 分鐘會 progress heartbeat。
  • FAIL:沒有 HTTP 200 或 file list 仍允許 complete。

Data / API

  • PASS:包含 worker_runs、job_leases、helper_events、repair_actions、artifact_checks。
  • PASS:包含 API、權限與稽核。

D30

  • PASS:可用 incident evidence 決定續行/升級/watchdog agent。
  • FAIL:只看「最後有沒有成功」,沒有保留失敗與修復證據。

Artifact URL or PR

Primary artifact: https://gist.github.com/esz135888/e36b37371330f2e74e73af799dd7f725#file-codex-worker-self-repair-watchdog-console-html

Public Gist: https://gist.github.com/esz135888/e36b37371330f2e74e73af799dd7f725

Verification commands:

  • curl -I -L -s "https://gist.github.com/esz135888/e36b37371330f2e74e73af799dd7f725#file-codex-worker-self-repair-watchdog-console-html" | head -n 8
  • gh gist view e36b37371330f2e74e73af799dd7f725 --files

Verification result: primary URL returned HTTP/2 200; public Gist includes 13 files; no local pending publish markers remain.

<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Codex Worker Self-Repair Watchdog Console</title>
<style>
:root{--ink:#111417;--paper:#f4f1e9;--panel:#fffaf0;--line:#20242a;--red:#b83b32;--green:#187354;--blue:#245f8d;--gold:#b67916;--muted:#676b70}
*{box-sizing:border-box}body{margin:0;color:var(--ink);background:linear-gradient(90deg,rgba(36,95,141,.08) 1px,transparent 1px) 0 0/42px 42px,linear-gradient(rgba(17,20,23,.05) 1px,transparent 1px) 0 0/42px 42px,var(--paper);font-family:ui-serif,Georgia,"Times New Roman",serif}
header{min-height:86vh;display:grid;grid-template-columns:1.1fr .9fr;gap:34px;align-items:end;padding:56px clamp(20px,5vw,84px) 42px;border-bottom:3px solid var(--line)}
h1{font-size:clamp(46px,8vw,112px);line-height:.88;margin:18px 0 22px;letter-spacing:0}.tag{display:inline-block;background:var(--ink);color:var(--paper);border:2px solid var(--line);padding:8px 12px;font:800 13px ui-monospace,SFMono-Regular,Menlo,monospace;text-transform:uppercase}.lead{font-size:clamp(18px,2vw,27px);line-height:1.38;max-width:820px;color:#30343a}
.board{border:3px solid var(--line);background:var(--panel);box-shadow:12px 12px 0 var(--line);padding:20px;display:grid;gap:14px}.stat{border:2px solid var(--line);background:#fff;padding:15px}.stat b{display:block;font:900 46px/1 ui-monospace,SFMono-Regular,Menlo,monospace}.stat span{font:800 12px ui-monospace,SFMono-Regular,Menlo,monospace;color:var(--muted);text-transform:uppercase}
main{padding:32px clamp(18px,4vw,64px) 78px}.grid{display:grid;grid-template-columns:repeat(12,1fr);gap:18px}section{border:2px solid var(--line);background:rgba(255,250,240,.94);padding:20px}.span-12{grid-column:span 12}.span-8{grid-column:span 8}.span-6{grid-column:span 6}.span-4{grid-column:span 4}
h2{margin:0 0 14px;font-size:29px}p,li{line-height:1.55}.flow{display:grid;grid-template-columns:repeat(5,1fr);gap:12px}.step,.action{border:2px solid var(--line);background:#fff;padding:14px;min-height:142px}.step b{display:block;font:900 18px ui-monospace,SFMono-Regular,Menlo,monospace;margin-bottom:8px}
table{width:100%;border-collapse:collapse;background:#fff}th,td{border:1px solid var(--line);padding:10px;text-align:left;vertical-align:top}th{background:#e4edf3;font-family:ui-monospace,SFMono-Regular,Menlo,monospace}.badge{display:inline-block;border:2px solid var(--line);padding:5px 9px;background:#fff;font:800 12px ui-monospace,SFMono-Regular,Menlo,monospace}.green{background:var(--green);color:#fff}.gold{background:var(--gold);color:#fff}.blue{background:var(--blue);color:#fff}.red{background:var(--red);color:#fff}
.actions{display:grid;grid-template-columns:repeat(3,1fr);gap:12px}@media(max-width:900px){header,.flow,.actions{grid-template-columns:1fr}.span-4,.span-6,.span-8,.span-12{grid-column:span 12}h1{font-size:52px}}
</style>
</head>
<body>
<header>
<div><span class="tag">Codex Session / Worker Stability</span><h1>讓 worker 壞掉以前,先自己留下可修復證據。</h1><p class="lead">本控制台把 Codex Session / worker 穩定與自修復做成 production watchdog:監測 claim/context/progress/complete、lease 過期、queue stuck、500 fetch failed、artifact gate,並把自癒動作、升級規則、稽核與 D30 驗收接到 PLS。</p></div>
<aside class="board"><div class="stat"><span>Watchdog Scope</span><b>5 paths</b><p>doctor、touch、claim、context/progress、upload/complete。</p></div><div class="stat"><span>D30 Goal</span><b>80%</b><p>stuck job 自動分類與修復建議覆蓋率。</p></div><div class="stat"><span>Human Page</span><b>0 false green</b><p>沒有 open artifact 或 verified output,不得顯示成功。</p></div></aside>
</header>
<main><div class="grid">
<section class="span-12"><h2>D1 / D7 / D14 / D30</h2><div class="flow"><div class="step"><b>D1</b>定義 lease、queue、helper command、artifact gate 的 watchdog 指標與門檻。</div><div class="step"><b>D7</b>交付控制台、runbook、data model、acceptance tests;先用人工 helper output 驗證。</div><div class="step"><b>D14</b>串 PLS job events,回填 3 種 incident:fetch failed、lease expired、artifact gate fail。</div><div class="step"><b>D30</b>決定續行、升級 system/watchdog agent,或換 owner。</div><div class="step"><b>Loop</b>每次 stuck 自動產生 repair action 與 learning memory。</div></div></section>
<section class="span-8"><h2>Watchdog Rules</h2><table><tr><th>Signal</th><th>Threshold</th><th>Auto Repair</th><th>Escalate</th></tr><tr><td>claim no output</td><td>30s no JSON</td><td>touch + retry claim once</td><td>2 次失敗通知 owner</td></tr><tr><td>context/progress fetch failed</td><td>500 TypeError</td><td>touch + sleep 3 + retry</td><td>保留 claim payload 繼續 build</td></tr><tr><td>lease near expiry</td><td>&lt; 3 min</td><td>progress heartbeat</td><td>若上傳中斷,mark stuck</td></tr><tr><td>queue stuck</td><td>running &gt; lease + 5 min</td><td>release/reclaim policy</td><td>supervisor review</td></tr><tr><td>artifact gate fail</td><td>no HTTP 200 / no file list</td><td>block complete</td><td>mark stuck with evidence</td></tr></table></section>
<section class="span-4"><h2>Solution</h2><p><span class="badge red">watchdog</span> 核心交付。</p><p><span class="badge blue">system</span> 需要 DB/API/稽核。</p><p><span class="badge green">runbook</span> 自修復操作。</p><p><span class="badge gold">eval</span> pass/fail 驗收。</p></section>
<section class="span-6"><h2>Self-Repair Runbook</h2><table><tr><th>狀況</th><th>固定處理</th></tr><tr><td>fetch failed</td><td>touch → sleep 3 → retry once → 保存輸出</td></tr><tr><td>complete gate fail</td><td>改 artifacts_json 為 shared_cloud Gist URL,不用 pls_file 作 primary</td></tr><tr><td>no job</td><td>讀 doctor/touch/claim,產 backlog repair proposal</td></tr><tr><td>stale lease</td><td>標記 stuck reason,避免雙 worker complete</td></tr></table></section>
<section class="span-6"><h2>Production Path</h2><table><tr><th>Layer</th><th>Spec</th></tr><tr><td>Data</td><td>worker_runs、job_leases、helper_events、repair_actions、artifact_checks</td></tr><tr><td>API</td><td>POST health, POST repair-action, PATCH lease, POST artifact-check</td></tr><tr><td>Permission</td><td>Worker 可建議修復;Supervisor 可釋放/重派;System 可重試低風險命令</td></tr><tr><td>Audit</td><td>每次 retry、release、complete 都留 command output hash</td></tr></table></section>
<section class="span-12"><h2>People Sync</h2><div class="actions"><div class="action"><b>給 PLS owner</b><p>請確認 watchdog 門檻:claim 30s、lease &lt;3min、queue stuck lease+5min、complete gate 必須 HTTP 200 與 file list。</p></div><div class="action"><b>給 worker operator</b><p>遇到 500 fetch failed 時不要放棄:touch、sleep 3、retry;若 artifact gate 缺 open URL,不得 complete。</p></div><div class="action"><b>升級句</b><p>若 2 次 retry 仍失敗,回填 stuck reason 與完整 command output,交 supervisor 決定 release/reclaim。</p></div></div></section>
</div></main>
</body></html>

Data Model

Tables

worker_runs

field type note
id uuid run id
worker_id text worker
job_id uuid job
phase enum doctor,touch,claim,context,progress,build,upload,complete
status enum ok,failed,retried,stuck
started_at timestamptz start
ended_at timestamptz end

job_leases

field type note
job_id uuid job
lease_expires_at timestamptz lease
last_heartbeat_at timestamptz heartbeat
risk_level enum normal,near_expiry,expired,stale

helper_events

field type note
id uuid event
job_id uuid job
command text helper command
exit_code integer exit
output_hash text output hash
error_code text INTERNAL_ERROR etc
created_at timestamptz time

repair_actions

field type note
id uuid action
job_id uuid job
trigger text fetch failed, stale lease, artifact gate
action text touch/retry/mark_stuck/release
authority enum worker,supervisor,system
result enum success,failed,needs_human

artifact_checks

field type note
id uuid check
job_id uuid job
primary_url text URL
http_status integer 200 expected
file_count integer 13 expected
checked_at timestamptz time

API / Sync

  • POST /api/watchdog/worker-runs
  • PATCH /api/watchdog/job-leases/:job_id
  • POST /api/watchdog/helper-events
  • POST /api/watchdog/repair-actions
  • POST /api/watchdog/artifact-checks

Permissions / Audit

Worker 可記錄與建議修復;System 可執行低風險 retry;Supervisor 可 release/reclaim;所有 retry/complete/release 必須保留 command output hash。

Decision Record

Decision

採用 watchdog + system + sop + eval + governance

Options

  1. 只寫 runbook:快,但不能監控。
  2. 只做 dashboard:可視化,但不能自癒。
  3. watchdog control pack:同時有門檻、資料模型、runbook、驗收與升級路徑。

Recommendation

採用 option 3。此案核心是 production reliability,不是溝通或摘要。

Adoption Status

ready for owner review。

Landing Path

D1 定義門檻;D7 用人工 helper evidence 跑;D14 接事件資料;D30 升級 system/watchdog agent。

If Rejected

請回饋哪個門檻過嚴、哪個自癒動作權限過高、哪個事件資料無法取得。

E2E Verification

Plan

  1. Publish primary HTML and appendices to public Gist.
  2. Verify primary URL returns HTTP 200.
  3. Verify Gist includes 13 files.
  4. Upload files to PLS deliverable id.
  5. Complete with stable public artifact URLs.

Primary Artifact

https://gist.github.com/esz135888/e36b37371330f2e74e73af799dd7f725#file-codex-worker-self-repair-watchdog-console-html

Evidence

  • Published public Gist: https://gist.github.com/esz135888/e36b37371330f2e74e73af799dd7f725
  • Verification command: curl -I -L -s "https://gist.github.com/esz135888/e36b37371330f2e74e73af799dd7f725#file-codex-worker-self-repair-watchdog-console-html" | head -n 8
  • File list command: gh gist view e36b37371330f2e74e73af799dd7f725 --files
  • Result: primary URL returned HTTP/2 200; public Gist file list showed all 13 files; local pending marker scan returned no matches.

Acceptance Mapping

  • Openable main artifact: codex-worker-self-repair-watchdog-console.html.
  • Owner/due/acceptance: production-brief.md, people-sync.md, acceptance-tests.md.
  • Data/toolbox path: data-model.md, production-readiness.md.
  • Decision record: decision-record.md.
{
"project_id": "360bdd66-9e28-4b00-9561-a2c56fedeaae",
"job_id": "8db2eb2a-78ad-4701-9e2a-ef2a857966ac",
"learned": [
"Worker reliability needs command-output evidence, not final status alone.",
"Artifact gates must verify open primary URL and file list before complete.",
"Fetch failed can often be recovered by touch, short delay, and a single retry."
],
"next_worker_should_check": [
"Whether helper_events and artifact_checks can be stored in PLS DB.",
"Whether lease near-expiry can trigger progress heartbeat automatically.",
"Whether release/reclaim authority is supervisor-only."
],
"assumptions_to_test": [
"Most transient complete/context failures are recoverable with one retry.",
"Queue stuck can be detected by lease_expires_at plus last_heartbeat_at.",
"Operators will accept block-complete when no open artifact exists."
],
"upgrade_trigger": "After 3 real incident evidence records, implement PLS watchdog cron or workflow app."
}

Market Maturity

Sources Checked

Comparable Practice

成熟 SRE 不是看服務「看起來活著」,而是用 latency、traffic、errors、saturation 與 actionable alerts。Agent/worker 系統還需要 tool-call tracing、run-level evidence、runbook execution 與 audit。

PLS Gap

PLS 已有 helper commands,但缺統一 watchdog schema、queue/lease stuck 門檻、artifact gate 與自癒 action audit。

This Round Upgrade

本輪補上 dashboard、data model、API、runbook、acceptance tests、people sync 與 learning memory。

People Sync

LINE Draft: PLS Owner

我已把 Codex Session / worker 穩定與自修復整理成 watchdog production pack。請確認門檻:claim 30 秒無 JSON、lease 小於 3 分鐘、running 超過 lease+5 分鐘、artifact gate 必須 HTTP 200 + file list。通過後下一輪可做 DB/API/cron。

LINE Draft: Worker Operator

遇到 500 TypeError fetch failed,不要直接失敗:先 touch、sleep 3、retry once;如果 primary artifact 沒有公開 URL 或沒有 HTTP 200,不得 complete。

Escalation

同一 job 兩次 retry 仍失敗,請 mark stuck 並附 command output;release/reclaim 交 supervisor。

Expected Evidence

  • helper command output。
  • artifact HTTP 200。
  • file list count。
  • repair action event。

Production Brief

場景

Codex Session / worker 穩定與自修復需要從零散人工經驗,升級為可驗收 watchdog:監控 claim、context、progress、upload-files、complete、lease、queue stuck 與 artifact gate。

D1 / D7 / D14 / D30

  • D1:定義 watchdog 指標、門檻、自癒行為。
  • D7:交付控制台、runbook、data model、acceptance tests。
  • D14:回填 3 種 incident 證據:fetch failed、lease near expiry、artifact gate fail。
  • D30:決定續行、升級 system/watchdog agent 或換 owner。

Owner / Due / Acceptance

  • Owner:PLS/Codex worker operator。
  • D7 due:2026-05-31。
  • D30 due:2026-06-23。
  • Acceptance:主成果可開;資料模型、API、權限、稽核、自癒 runbook 與 pass/fail 驗收齊全。

價值/錢路徑

降低 worker stuck、重工、錯誤 complete 與人工排查時間,讓 production delivery 不因 fetch failed 或 artifact gate 遺漏而卡住。穩定交付會提高 AI 推進產能、減少人工救火、保護客戶交付可信度。

Purpose-to-Purpose E2E

原始目的:worker 穩定與自修復。產出物:watchdog console + runbook + schema。人採用:worker operator 照門檻處理。指標改善:stuck job 分類率、retry 成功率、artifact gate pass、錯誤 complete 歸零。

Production Readiness

Ready Now

  • 可用控制台作為 watchdog spec。
  • 可用 runbook 處理 fetch failed、lease risk、artifact gate fail。
  • 可把 data model 轉進 PLS 後台。

Not Yet Ready

  • 尚未接 DB migration。
  • 尚未自動讀取 Hermes queue metrics。
  • 尚未自動 release/reclaim,需 supervisor 權限。

Rollback / Fail-Safe

  • 自動修復最多 retry once。
  • release/reclaim 必須 supervisor approve。
  • no open artifact 一律 block complete。

Upgrade Path

D14 收集 3 種 incident evidence 後,建 PLS watchdog workflow app 與 cron/agent。

Skill Usage

Selected Skills / Tools

  • frontend-design:建立可開 primary HTML dashboard。
  • verification-before-completion:完成前驗證 Gist HTTP、file list、upload-files、complete。
  • PLS helper:doctor、touch、claim、context、progress、upload-files、complete。
  • Web search:查核 SRE golden signals、production sanity checks、agent observability comparable practice。
  • GitHub Gist CLI:發布 public artifact。

Evidence

Solution Selection

Selected

  • watchdog:任務要求明確是 Codex Session / worker 穩定。
  • system:需要 job leases、helper events、artifact checks。
  • sop:需要固定 retry/runbook。
  • eval:需要 pass/fail gate。
  • governance:release/reclaim/complete 屬高風險動作。

Why Not Smaller

LINE 話術或普通文件無法防止 stale lease、queue stuck 或錯誤 complete。

Why Not Larger

目前先交 production pack,不直接改 PLS DB 或部署 agent;D14 有 incident evidence 後再升級。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment