Skip to content

Instantly share code, notes, and snippets.

@esz135888
Last active May 24, 2026 00:44
Show Gist options
  • Select an option

  • Save esz135888/ad34202deb260f9da8d854bdac5bb4c5 to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/ad34202deb260f9da8d854bdac5bb4c5 to your computer and use it in GitHub Desktop.
PLS job 2f3ba779 Codex session worker stability watchdog production pack

Data Model And API Spec

Tables

worker_heartbeats

  • id uuid primary key
  • worker_id text
  • worker_role text
  • session_mode text
  • touched_at timestamptz
  • doctor_ok boolean
  • worker_kinds text[]
  • last_claim_job_id uuid nullable

job_leases

  • job_id uuid primary key
  • attempt_id uuid
  • worker_id text
  • status text
  • claimed_at timestamptz
  • lease_expires_at timestamptz
  • last_progress_phase text
  • last_progress_at timestamptz

watchdog_incidents

  • id uuid primary key
  • job_id uuid nullable
  • incident_type enum: lease_expired, queue_stuck, claim_loop_idle, artifact_gate_failed, schema_validation_failed
  • severity enum: info, warn, critical
  • detected_at timestamptz
  • action_taken text
  • evidence jsonb
  • resolved_at timestamptz nullable

artifact_verifications

  • id uuid primary key
  • job_id uuid
  • artifact_kind text
  • url text
  • http_status int
  • file_list text[]
  • verified_at timestamptz

APIs

  • GET /api/pls/watchdog/queue-health
  • POST /api/pls/watchdog/incidents
  • PATCH /api/pls/jobs/:job_id/lease
  • POST /api/pls/jobs/:job_id/artifact-verification
  • POST /api/pls/watchdog/repair-proposal

Permissions And Audit

The watchdog may diagnose and write incidents automatically. Release/reclaim/pause actions require policy approval and must write old state, new state, command evidence, actor and reason.

Rollback

Rollback restores previous worker owner/lease when possible. If state cannot be safely restored, pause the pipeline with a visible blocker reason rather than forcing progress.

Decision Record

Decision

Create a production watchdog console for Codex Session worker stability and self-repair.

Options Considered

  • Communication-only escalation: too weak for platform reliability.
  • Static runbook only: useful but cannot expose live gates.
  • HTML watchdog console plus schema/runbook/rules: best minimal production artifact.
  • Full PR/system implementation: better later, but current context has no repo target or deployment contract.

Recommendation

Adopt the console as the D1 artifact, then use the schema and runbook to implement scheduled checks in PLS.

Adoption Status

Ready for PLS review. Self-heal actions remain policy-gated until Louis/platform approval.

Feedback Needed If Not Adopted

Specify which policy is unacceptable: auto release/reclaim, alert channel, retry behavior, or audit requirements.

E2E Verification

Checks

  1. Context confirms codex-session-stability and watchdog solution selection.
  2. Primary artifact is an HTML console, not Markdown.
  3. Console includes D1/D7/D14/D30, purpose-to-purpose E2E, owner/due/acceptance, project merge rule, people sync and production readiness.
  4. Appendices include data model, API, permissions, audit, rollback, runbook, market maturity, learning memory and decision record.
  5. Public artifact URL must return HTTP 200 and gh gist view --files must list worker-stability-watchdog-console.html.

Current Result

Pass. Public Gist returned HTTP 200 on 2026-05-24 and gh gist view --files listed worker-stability-watchdog-console.html plus all appendices.

Primary URL: https://gist.github.com/esz135888/ad34202deb260f9da8d854bdac5bb4c5#file-worker-stability-watchdog-console-html

{
"job_id": "2f3ba779-aa85-4a30-b207-83d93fdc2bb0",
"topic_key": "codex-session-stability",
"what_hermes_learned": [
"PLS worker reliability depends on disciplined command sequencing plus explicit artifact verification.",
"Completion schema errors are a recurring operational risk and should be treated as watchdog-detectable incidents.",
"Self-heal should be policy-limited: diagnose and retry safely before release/reclaim."
],
"assumptions_to_test_next": [
"A 10-minute no-progress window is the right queue stuck threshold.",
"Stale leases can be safely released or reclaimed after policy approval.",
"Gist/URL verification is acceptable as a D1 artifact gate until PLS has native file tunnel verification."
],
"next_round_priority": "Implement a scheduled watchdog worker that writes incidents and repair proposals from real queue and lease data."
}

Market Maturity

Sources

Comparable Practice

Google SRE monitoring guidance treats monitoring as a way to expose actionable production symptoms. The PLS equivalent is not "more logs"; it is clear signals for lease expiry, queue stuck, claim-loop idle and artifact gate failure.

Google SRE automation guidance frames automation as the path away from repeated operational toil. The PLS equivalent is policy-limited self-heal: retry transient errors, correct known schema shape errors, and escalate only when safe automation cannot repair.

Maturity Gap

PLS has the helper commands and job contract, but the reliability loop is still mostly carried by the current worker's discipline. The next maturity level is an auditable watchdog that can detect, repair, or pause without relying on memory.

People Sync

Targets

  • 80241131-85fe-44f1-a347-35a39b8f6ce5
  • a99e5c60-898a-4dbb-aeab-47e62c51bcc7
  • b4e18b57-9add-4876-949a-e8103691030c

LINE Draft

這輪我把 Codex Session / worker 穩定性做成 watchdog console,不再只靠人記得看 log。

請協助確認 D7 驗收政策:

  1. stale lease 是否允許 watchdog 自動 release/reclaim?
  2. queue stuck 要推 LINE alert,還是只進 PLS ops dashboard?
  3. artifact gate fail 時,worker 應 pause,還是允許 retry 一次?

如果今天沒有回覆,我會先把 policy 標成 pending_approval,watchdog 只能診斷與產生修復提案,不執行 release/reclaim。

Expected Reply Signal

The reply should approve or reject the three policy choices and name the owner for D7 auto-triage proof.

Production Acceptance

Owner

PLS platform / Louis.

Due

D7 after this iteration for first auto-triage proof.

Acceptance Criteria

  • Stale lease scenario produces a lease_expired incident with evidence.
  • Queue stuck scenario produces a queue_stuck incident and repair proposal.
  • Artifact gate failure blocks complete until a primary artifact is openable.
  • Completion schema errors are corrected into accepted array + enum shape.
  • All automated actions write audit records.

Stop Conditions

  • Do not auto-release/reclaim if policy approval is missing.
  • Do not complete without HTTP 200 primary artifact verification.
  • Do not treat no-job claim as success; inspect backlog and worker role eligibility.

Codex Session Worker Stability Watchdog Brief

Scene

The shared signal is Codex Session / worker stability and self-repair. Latest evidence says E2E signals should advance projects and strengthen huber clones. The risk is silent drift: jobs can be claimed, leased, or completed incorrectly without enough production visibility.

D1 / D7 / D14 / D30

  • D1: publish a watchdog console, define lease/queue/artifact gates, and normalize completion artifact schema.
  • D7: detect stale lease, queue stuck, no-job loop, and artifact gate failures with evidence.
  • D14: add safe self-heal actions for touch/release/reclaim/pause under explicit policy.
  • D30: integrate the watchdog into a PLS ops console with SLO, audit, rollback, and worker capability score.

Purpose-to-Purpose E2E

Original purpose: signals should push projects forward and improve worker capability. Artifact: watchdog console plus rules, schema, runbook, and acceptance gates. Human adoption: PLS operators use the console to approve thresholds and escalation policy. System outcome: fewer stuck jobs, fewer invalid completes, clearer queue repair proposals. Money/risk path: reduces manual supervision cost, protects delivery throughput, and prevents false claims of deployment/GitHub/LINE success.

Owner / Due / Acceptance

  • Owner: PLS platform / Louis.
  • Due: D7 for first auto-triage proof.
  • Acceptance: one simulated stale lease, one no-job loop, and one artifact-gate failure are detected and recorded with command evidence.

Watchdog Runbook

Trigger

Run every heartbeat after doctor, touch, and claim, and before every complete.

Diagnostic Steps

  1. Verify doctor reports token, env file, worker role and worker kinds.
  2. Confirm touch succeeds and records a current heartbeat.
  3. Inspect claim output for job id, lease expiry, worker kind, and completion contract.
  4. If a job is claimed, read context and immediately write reading_context progress.
  5. Before complete, verify the primary artifact URL returns HTTP 200 and file listing includes the primary artifact.

Repair Policy

  • Retry progress/complete once when the error is transient network or fetch failure.
  • Convert artifacts_json schema errors into corrected enum/array shape immediately.
  • Do not claim GitHub, deployment, LINE, or artifact success without verified command output.
  • Do not release/reclaim jobs unless the approved policy says the lease is stale.

Escalation

If no eligible job exists after repeated claims, produce a backlog/capability repair proposal rather than staying silent.

Signal Annotations

  • project: AI native project 9c53826a-5f44-44da-b296-0b86fec226a6, topic codex-session-stability.
  • person: related profiles 80241131-85fe-44f1-a347-35a39b8f6ce5, a99e5c60-898a-4dbb-aeab-47e62c51bcc7, b4e18b57-9add-4876-949a-e8103691030c.
  • decision: merge source projects 360bdd66, 9ae8f0da, and d2afbba2 into one platform reliability project.
  • risk: silent worker drift, stale leases, stuck queues, schema validation failures, and unverifiable artifact completion.
  • source: E2E test says incoming signals must advance projects and strengthen huber clones.
  • solution_selection: watchdog/watchdog_config.

Skill Usage

  • purpose_e2e_toolbox_v2: applied to D1/D7/D14/D30, purpose-to-purpose E2E, value path, human capability, stack, schema/API/sync/permissions/audit and decision record.
  • PLS solution matrix: selected watchdog because the topic is reliability and self-repair.
  • Market maturity references: used Google SRE monitoring and automation guidance as inputs; the primary output remains a production watchdog artifact.

Solution Selection

Selected

watchdog/watchdog_config.

Rationale

The shared topic is platform reliability, not a single blocked owner. The production problem is observable and operational: jobs need lease checks, queue stuck detection, artifact gates, safe retry, audit, and escalation.

Rejected Options

  • LINE draft: useful only for human escalation, not for stability detection.
  • One-page memo: documents the issue but does not create a monitoring surface.
  • Generic dashboard: too broad unless tied to concrete watchdog rules.

Next Upgrade

When D7 proves detection, turn the rules into a scheduled PLS worker with policy-limited self-heal actions.

rules:
- id: lease_expired
threshold: "now > lease_expires_at + 60s"
action: "touch worker; mark stale lease; release or reclaim only under approved policy"
escalation: "notify Louis after 2 failed repairs"
- id: queue_stuck
threshold: "pending/running with no heartbeat for more than 10 minutes"
action: "inspect worker_kind, source, contract and last_progress_payload"
escalation: "create backlog/capability repair proposal if no eligible worker exists"
- id: claim_loop_idle
threshold: "3 consecutive no-job claims"
action: "inspect eligible worker roles and backlog distribution"
escalation: "recommend project/backlog/capability repair"
- id: artifact_gate_failed
threshold: "complete requested without openable primary artifact or HTTP 200 verification"
action: "block complete and request production artifact"
escalation: "retry once only if transient"
<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Codex Session Worker Stability Watchdog</title>
<style>
:root {
--ink:#17212b; --muted:#637082; --line:#dbe2ea; --bg:#f5f7f9; --panel:#fff;
--blue:#245f8f; --green:#1f7a5a; --amber:#b77a18; --red:#b13b32; --soft:#eef4f8;
}
*{box-sizing:border-box} body{margin:0;background:var(--bg);color:var(--ink);font-family:Inter,ui-sans-serif,system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;line-height:1.45}
header{padding:28px 36px 22px;background:#fff;border-bottom:1px solid var(--line)}
h1,h2,h3{margin:0;letter-spacing:0} h1{font-size:28px;max-width:1050px} h2{font-size:17px;margin-bottom:12px} h3{font-size:14px;margin-bottom:8px}
p{margin:0;color:var(--muted)} .eyebrow{font-size:12px;font-weight:800;text-transform:uppercase;color:var(--blue);margin-bottom:8px}
.wrap{padding:24px 36px 42px}.grid{display:grid;grid-template-columns:repeat(12,1fr);gap:16px;max-width:1280px;margin:0 auto}
.panel{background:var(--panel);border:1px solid var(--line);border-radius:8px;padding:18px;box-shadow:0 1px 2px rgba(0,0,0,.03)}
.span-12{grid-column:span 12}.span-8{grid-column:span 8}.span-6{grid-column:span 6}.span-4{grid-column:span 4}.span-3{grid-column:span 3}
.kpis{display:grid;grid-template-columns:repeat(4,1fr);gap:12px}.kpi{border:1px solid var(--line);border-radius:8px;padding:14px;min-height:92px;background:#fbfcfd}
.kpi strong{display:block;font-size:28px;margin-bottom:4px}.kpi span{color:var(--muted);font-size:13px}
table{width:100%;border-collapse:collapse;font-size:13px} th,td{text-align:left;padding:10px 8px;border-bottom:1px solid var(--line);vertical-align:top}
th{font-size:12px;text-transform:uppercase;color:var(--muted);background:#fbfcfd}.status{display:inline-flex;border-radius:999px;padding:4px 9px;font-size:12px;font-weight:800;white-space:nowrap}
.green{background:#e2f1ea;color:var(--green)}.amber{background:#f7ecd8;color:#81530f}.red{background:#f9e6e3;color:var(--red)}.blue{background:#e4eef8;color:var(--blue)}
.timeline{display:grid;grid-template-columns:repeat(4,1fr);gap:12px}.stage{border-left:4px solid var(--blue);background:#fbfcfd;padding:12px;border-radius:0 8px 8px 0;min-height:150px}
.stage:nth-child(2){border-color:var(--amber)}.stage:nth-child(3){border-color:var(--green)}.stage:nth-child(4){border-color:var(--red)}
ul{margin:8px 0 0 18px;padding:0;color:var(--muted)}li{margin:4px 0}.flow{display:grid;grid-template-columns:repeat(5,1fr);gap:10px}
.step{border:1px solid var(--line);border-radius:8px;padding:12px;background:#fbfcfd;min-height:118px}.step b{display:block;margin-bottom:6px}.small{font-size:12px;color:var(--muted)}
.code{background:#15202b;color:#e9f0f6;border-radius:8px;padding:12px;font:12px ui-monospace,SFMono-Regular,Menlo,monospace;white-space:pre-wrap;overflow:auto}
.ask{background:var(--soft);border-radius:8px;padding:12px;color:var(--ink);font-size:13px;white-space:pre-wrap}
@media(max-width:900px){header,.wrap{padding-left:18px;padding-right:18px}.span-8,.span-6,.span-4,.span-3{grid-column:span 12}.kpis,.timeline,.flow{grid-template-columns:1fr}}
</style>
</head>
<body>
<header>
<div class="eyebrow">PLS production artifact · codex-session-stability · 2026-05-24</div>
<h1>Codex Session Worker Stability Watchdog</h1>
<p>讓「訊號進來要推進專案與強化 huber 分身」變成可監控、可自修復、可驗收的 worker reliability loop。</p>
</header>
<main class="wrap">
<section class="grid">
<div class="panel span-12">
<div class="kpis">
<div class="kpi"><strong>11</strong><span>近期 stability 訊號</span></div>
<div class="kpi"><strong>15m</strong><span>lease expiry hard gate</span></div>
<div class="kpi"><strong>3</strong><span>source projects 建議合併</span></div>
<div class="kpi"><strong>D7</strong><span>第一版 stuck queue 自修復驗收</span></div>
</div>
</div>
<div class="panel span-8">
<h2>Watchdog Rules</h2>
<table>
<thead><tr><th>Signal</th><th>Threshold</th><th>Action</th><th>Escalation</th><th>Pass Evidence</th></tr></thead>
<tbody>
<tr><td><span class="status red">lease_expired</span></td><td>now > lease_expires_at + 60s</td><td>touch worker, mark stale lease, release or reclaim job</td><td>after 2 failures notify Louis</td><td>job returns to claimable or running with new lease</td></tr>
<tr><td><span class="status amber">queue_stuck</span></td><td>pending/running with no heartbeat > 10m</td><td>diagnose worker_kind, source, contract, last_progress_payload</td><td>create repair proposal if no eligible worker</td><td>queue age decreases within one heartbeat</td></tr>
<tr><td><span class="status blue">claim_loop_idle</span></td><td>3 no-job claims</td><td>inspect eligible kinds and backlog shape</td><td>recommend backlog/capability repair</td><td>next heartbeat has job or explicit repair</td></tr>
<tr><td><span class="status green">artifact_gate</span></td><td>before complete</td><td>verify primary artifact URL HTTP 200 and files listed</td><td>block complete if no openable artifact</td><td>curl/gh output stored in e2e verification</td></tr>
</tbody>
</table>
</div>
<div class="panel span-4">
<h2>Heartbeat Command Contract</h2>
<div class="code">doctor
touch
claim
context &lt;job_id&gt;
progress &lt;job_id&gt; "正在處理..." reading_context
progress ... building_artifact
verify artifact URL
complete &lt;job_id&gt; summary artifacts_json</div>
</div>
<div class="panel span-12">
<h2>D1 / D7 / D14 / D30 Path</h2>
<div class="timeline">
<div class="stage"><h3>D1 · Instrument</h3><ul><li>Define lease/queue/artifact gates.</li><li>Publish watchdog console and runbook.</li><li>Normalize artifacts_json enum values.</li></ul></div>
<div class="stage"><h3>D7 · Auto-triage</h3><ul><li>Detect stuck jobs and missing worker kinds.</li><li>Generate repair proposal when no eligible task exists.</li><li>Capture verification output before complete.</li></ul></div>
<div class="stage"><h3>D14 · Self-heal</h3><ul><li>Auto-touch or release stale leases under policy.</li><li>Route impossible jobs to pause/fail with reason.</li><li>Score worker health by role.</li></ul></div>
<div class="stage"><h3>D30 · Reliability OS</h3><ul><li>Dashboard enters PLS ops console.</li><li>Worker pools have SLO, audit and rollback.</li><li>Huber clone capability improves from evidence.</li></ul></div>
</div>
</div>
<div class="panel span-12">
<h2>Purpose-to-Purpose E2E</h2>
<div class="flow">
<div class="step"><b>Signal</b><span class="small">E2E test says signals must advance projects and strengthen huber clones.</span></div>
<div class="step"><b>Artifact</b><span class="small">Watchdog console, rules, schema, runbook and acceptance gates.</span></div>
<div class="step"><b>Adoption</b><span class="small">Worker follows command contract and records phase-specific progress.</span></div>
<div class="step"><b>Reliability</b><span class="small">Queue stuck, lease expiry and artifact failures are caught before silent drift.</span></div>
<div class="step"><b>Value</b><span class="small">Less manual supervision, fewer failed completions, more reliable AI project throughput.</span></div>
</div>
</div>
<div class="panel span-6">
<h2>Market Maturity Fit</h2>
<p>Google SRE monitoring guidance frames monitoring as production visibility for urgent, actionable symptoms. Google SRE automation guidance makes reliability work scalable by replacing repeatable toil with engineered automation. This watchdog applies those practices to PLS workers: only alert on actionable stuck states, and automate safe recovery before human escalation.</p>
<ul>
<li>https://sre.google/resources/book-update/monitoring-distributed-systems/</li>
<li>https://sre.google/resources/book-update/automation-at-google/</li>
</ul>
</div>
<div class="panel span-6">
<h2>Production Readiness</h2>
<table>
<tbody>
<tr><th>Data</th><td>worker_heartbeats, job_leases, watchdog_incidents, artifact_verifications</td></tr>
<tr><th>API</th><td>GET queue health, POST incident, PATCH lease, POST verification</td></tr>
<tr><th>Permissions</th><td>watchdog can diagnose; release/reclaim needs policy; destructive repair forbidden</td></tr>
<tr><th>Audit</th><td>Every auto-action stores old state, new state, command output, actor and reason</td></tr>
<tr><th>Rollback</th><td>Self-heal actions are reversible: restore previous owner/lease or pause with reason</td></tr>
</tbody>
</table>
</div>
<div class="panel span-4">
<h2>Acceptance</h2>
<p><b>Owner:</b> PLS platform / Louis.</p>
<p><b>Due:</b> D7 for first auto-triage proof.</p>
<p><b>Pass:</b> one simulated stale lease, one no-job loop, and one artifact-gate failure are detected with evidence.</p>
</div>
<div class="panel span-4">
<h2>Project Merge Rule</h2>
<p>Merge `360bdd66`, `9ae8f0da`, and `d2afbba2` into one stability project because claim/lease/queue stuck behavior is a shared platform reliability concern.</p>
</div>
<div class="panel span-4">
<h2>People Sync</h2>
<div class="ask">請確認這個 watchdog 的 D7 驗收要不要接進 PLS ops:
1. stale lease 是否允許自動 release/reclaim?
2. queue stuck alert 要推 LINE 還是只進 dashboard?
3. artifact gate fail 時,worker 應 pause 還是 retry 一次?</div>
</div>
</section>
</main>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment