Skip to content

Instantly share code, notes, and snippets.

@esz135888
Created May 23, 2026 23:34
Show Gist options
  • Select an option

  • Save esz135888/329091fb21a3b0d162a7878e8e802042 to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/329091fb21a3b0d162a7878e8e802042 to your computer and use it in GitHub Desktop.
PLS Codex Session stability watchdog pack - job 13f31bea

Artifact URL Or PR

Primary artifact: https://gist.github.com/esz135888/329091fb21a3b0d162a7878e8e802042

PR/deployment: not claimed. The PLS context provided no repository, branch, deployment target, or backend access.

Upload-files note: PLS context returned deliverable: null, so no deliverable_id was available for the fixed upload-files helper. This pack uses a shared-cloud Gist.

Verification:

  • Gist published publicly.
  • HTTP and GitHub CLI verification completed before PLS completion writeback.

Production Readiness: Data Model, API, Sync, Permissions, Audit

Data Model

worker_heartbeat

  • id
  • worker_id
  • automation_id
  • status
  • doctor_ok
  • touch_ok
  • claim_ok
  • current_job_id
  • lease_expires_at
  • created_at

worker_job_attempt

  • id
  • job_id
  • worker_id
  • attempt_id
  • phase
  • primary_artifact_ref
  • complete_status
  • error_code
  • error_message
  • started_at
  • completed_at

worker_watchdog_event

  • id
  • worker_id
  • job_id
  • event_type
  • severity
  • evidence_ref
  • action_taken
  • requires_human
  • created_at

API / Sync Spec

POST /api/pls/watchdog/heartbeat

Stores doctor/touch/claim health and current lease state.

POST /api/pls/watchdog/job-attempt

Stores context/progress/artifact/complete phases.

GET /api/pls/watchdog/workers/:worker_id/status

Returns health, current job, stuck reason, and next action.

Permissions

Worker status can be read by PLS operators and platform admins.

Raw token/env details must never be exposed.

Human-readable alerts can include worker ID, job ID, phase, and artifact URL, but not secrets.

Audit

Audit events: heartbeat received, job claimed, context read, progress written, artifact verified, complete succeeded, complete failed, no-job proposal produced, human escalation sent.

Decision Record

Decision

Create a Codex Session stability watchdog pack.

Problem

PLS workers can appear alive while projects do not advance, artifacts are missing, complete is rejected, or jobs become unavailable after claim. This creates invisible delivery risk.

Options

  1. Write a LINE warning only.
  2. Create an SOP for worker operators.
  3. Create watchdog config plus runbook, data model, E2E tests, and people sync.
  4. Build a full reliability console immediately.

Recommendation

Choose option 3.

Reasoning

The topic is explicitly worker stability and self-repair. Watchdog config creates measurable thresholds and bounded actions without pretending a full deployed system exists.

Adoption Status

Recommended for D1 implementation.

Feedback Needed If Not Adopted

Provide target alert channel, PLS backend table names, worker dashboard owner, and whether auto-release is allowed for expired leases.

E2E Verification Plan

Verification This Round

Artifact verification: shared-cloud Gist will be verified by HTTP and GitHub CLI before completion.

Runtime implementation: not claimed. The job context did not include a repo URL, PLS backend access, or deployment target.

Golden Tests

Test 1: Healthy no-job run. Expected: doctor ok, touch ok, claim ok with no job, repair proposal created after repeated no-job.

Test 2: Claimed job normal path. Expected: context ok, progress written, primary artifact verified, complete ok.

Test 3: Missing deliverable ID. Expected: upload-files skipped with explicit reason, shared-cloud artifact used, decision record notes gap.

Test 4: Complete rejected for missing kind. Expected: missing kind added, retry once, alert if rejected again.

Test 5: Job not found after claim. Expected: no false completion claim, artifact evidence preserved, ops notified.

Test 6: Lease expiring. Expected: alert fires before lease expiry and runbook suggests retouch or requeue decision.

Pass Threshold

The watchdog passes D1 if all six golden tests have expected status and at least one real worker run is recorded with artifact evidence.

{
"job_id": "13f31bea-7900-462d-8a46-56718f34c66f",
"topic_key": "codex-session-stability",
"ai_native_project_id": "9c53826a-5f44-44da-b296-0b86fec226a6",
"learning": [
"PLS worker reliability should be judged by project advancement evidence, not only heartbeat liveness.",
"The recurring failure classes are claim/no-job loops, lease expiry, missing deliverable IDs, artifact gaps, complete rejection, and job-not-found after claim.",
"A watchdog should preserve evidence and escalate rather than pretending a failed complete succeeded."
],
"next_round": [
"Implement worker_heartbeat, worker_job_attempt, and worker_watchdog_event storage in PLS.",
"Add first alert for complete_rejected and lease_expiring_with_claimed_job.",
"Create a reliability console once D1/D7 events are collected."
]
}

Market Context And Maturity

External Context

Google SRE guidance distinguishes monitoring symptoms and causes, and emphasizes alerts that require timely human action. This supports alerting on worker stuck symptoms such as lease expiry, missing artifact, and complete rejection.

Google SRE also frames toil as manual, repetitive, automatable operational work. A Codex worker watchdog should reduce toil by turning repeated claim/lease/queue checks into structured events and bounded self-heal actions.

OpenTelemetry describes observability around signals such as logs, metrics, and traces. PLS should model worker heartbeat, job attempts, and watchdog events as observable signals rather than unstructured chat output.

PLS Maturity Rating

Current maturity: Level 2 of 5.

Reason: doctor/touch/claim exists and workers can produce artifacts, but stuck states and result-contract failures are not yet a first-class reliability surface.

Target D30 maturity: Level 4 of 5.

Reason: PLS should have per-worker health, stuck classification, artifact evidence rate, result-contract failure rate, and escalation runbook.

Sources

People Sync

Targets

  • PLS platform owner
  • Codex Session worker maintainer
  • Related profile IDs from PLS context:
    • 80241131-85fe-44f1-a347-35a39b8f6ce5
    • a99e5c60-898a-4dbb-aeab-47e62c51bcc7
    • b4e18b57-9add-4876-949a-e8103691030c

LINE Draft

Codex Session / worker 穩定性這輪已整理成 watchdog pack。重點不是 worker 有沒有醒,而是醒來後是否真的推進專案:doctor/touch/claim/context/progress/artifact/complete 都要有狀態,claim/lease/queue stuck 要能被抓出來。

請在 2026-05-24 18:00 前回覆:

owner=<name>; adopt=watchdog_config|revise; alert_channel=<LINE/PLS ops>; first_metric=<lease_expiry|complete_rejected|artifact_missing|repeated_no_job>; blocker=<none/text>

Expected Adoption Signal

An owner accepts the watchdog thresholds and selects the first metric to implement in PLS ops.

Production Acceptance

Owner

PLS platform owner and Codex Session worker maintainer.

Due

2026-05-24 18:00 Asia/Taipei.

Acceptance Criteria

  • Watchdog config includes checks, thresholds, self-heal actions, and alert channels.
  • Runbook covers claim, lease, queue stuck, result contract failure, missing deliverable ID, and job not found.
  • Data model can store heartbeat, job attempt, and watchdog event.
  • E2E tests define pass/fail outcomes.
  • People sync identifies operator message and expected owner response.
  • Learning memory records how Hermes should route future worker stability signals.

Stop Conditions

  • Worker silently loops touch/claim without artifact or repair proposal.
  • Complete failures are hidden or reported as success.
  • Secrets are exposed in status or alerts.
  • Worker claims multiple primary jobs in one heartbeat.

Codex Session Stability Watchdog Pack

Job: 13f31bea-7900-462d-8a46-56718f34c66f AI native project: 9c53826a-5f44-44da-b296-0b86fec226a6 Topic: codex-session-stability Solution selection: watchdog / watchdog_config

Situation

Hermes detected repeated Codex Session / worker stability signals. The latest evidence says E2E signals should advance projects and strengthen the huber persona. This points to a reliability requirement: workers must not merely wake, touch, and claim; they must detect stuck claim/lease/queue states, self-heal where allowed, and create evidence that projects moved forward.

D1 / D7 / D14 / D30 Path

D1: Install the watchdog spec as an operating contract. Track doctor, touch, claim, context, progress, artifact publication, and complete result per run.

D7: Add stuck detection for lease expiry, repeated no-job claims, incomplete result contracts, missing deliverable IDs, and complete failures.

D14: Add self-heal runbook actions: re-touch, release/fail with reason, escalate contract gaps, and create repair proposals when no eligible tasks exist.

D30: Upgrade into a PLS worker reliability console: per-worker health, queue stuck age, result contract failures, artifact evidence rate, and project advancement score.

Purpose-To-Purpose E2E

Original purpose: when a signal arrives, the Codex Session worker should advance the project and strengthen the huber persona.

Output: watchdog config, runbook, data model, API/sync spec, E2E acceptance tests, and escalation LINE draft.

Human adoption: PLS operator uses the watchdog to see which workers are healthy, stuck, producing artifacts, or silently looping.

Project/money/risk improvement: fewer dead jobs, fewer missing artifacts, faster recovery from lease/queue issues, less manual checking, and more reliable AI project delivery.

Measurable loop: heartbeat fires -> doctor/touch/claim/context/progress logged -> job processed or no-job proposal made -> artifact/complete verified -> watchdog records pass/fail -> owner receives escalation when stuck.

Value And Money Path

Revenue protection: reliable worker delivery makes PLS more trustworthy for paid project automation.

Cost saving: reduces manual babysitting of stuck sessions and repeated incomplete completes.

Risk reduction: catches silent failures before users believe a project advanced when it did not.

Conversion: improves confidence that AI-native projects produce durable artifacts, not transient chat summaries.

Human leverage: operators get clear failure categories and next action instead of reading raw logs.

Owner / Due / Acceptance

Owner: PLS platform owner and Codex Session worker maintainer.

Due: 2026-05-24 18:00 Asia/Taipei for D1 watchdog acceptance.

Acceptance:

  • Watchdog status schema covers doctor, touch, claim, context, progress, artifact, complete, and lease.
  • Stuck states have thresholds and escalation.
  • Self-heal actions are bounded and auditable.
  • E2E tests cover no job, claimed job, missing deliverable, complete rejection, and job not found.

Self-Heal Runbook

Trigger Conditions

Lease expiring: claimed job has less than 5 minutes remaining and no complete result.

Queue stuck: claim repeatedly returns no eligible job while backlog exists or capability mismatch is suspected.

Result contract failure: complete returns missing artifact kind, no primary artifact, or non-production result.

Job missing after claim: complete or context returns not found after the worker had a running job.

Silent loop: doctor/touch/claim succeed but no project advancement artifact appears for three heartbeats.

Actions

  1. Retouch worker and confirm doctor is healthy.
  2. If job is claimed, run context and write progress immediately.
  3. If deliverable ID is missing, publish shared-cloud artifact and record upload-files as blocked.
  4. If complete is rejected for artifact kind, add the missing required kind before retrying.
  5. If job not found after claim, alert PLS ops with artifact URL and failure reason.
  6. If no job exists, produce a concise backlog/capability repair proposal instead of ending at touch/claim.

Escalation Message

PLS worker stability alert: worker is alive but project advancement is blocked. State=<state>; job=<job_id>; evidence=<url_or_log>; next action=<retouch|retry_complete|repair_contract|inspect_queue>; owner=PLS platform owner; due=<time>.

Rollback

Do not reset user files or force-release jobs automatically. If a job appears inconsistent, preserve evidence, alert owner, and let PLS decide whether to requeue.

Signal Annotations

Source

Source: company_signal_mastery Topic key: codex-session-stability Recent signal count: 11 Related people: 3 Related projects: 3 Latest evidence: E2E test says incoming signals must advance projects and strengthen huber persona.

Project Annotation

AI-native project 9c53826a-5f44-44da-b296-0b86fec226a6 should be treated as a reliability and self-repair project.

Person Annotation

Related profiles are operational reviewers or affected stakeholders until role names are resolved.

Decision Annotation

Decision: use watchdog, not LINE-only or generic analysis.

Risk Annotation

  • Worker alive but not advancing projects.
  • Lease expires mid-job.
  • Complete rejected after artifact work is done.
  • Missing deliverable ID blocks upload-files.
  • Job disappears after claim.
  • No-job loops hide backlog/capability mismatch.

Source Project Handling

Keep the three source projects separate until the watchdog records which failure mode each project contributes. Merge only if repeated evidence shows the same root cause across projects.

Skill Usage

Applied model: purpose_e2e_toolbox_v2

Application

30-day path: D1 watchdog contract, D7 stuck detection, D14 self-heal runbook, D30 reliability console.

Purpose-to-purpose: incoming signal becomes artifact-backed project advancement and reliability evidence.

Value path: reduces manual babysitting, failed completions, and silent non-delivery.

Human capability: operators can diagnose worker states using thresholds, runbook actions, and acceptance tests.

Solution stack: watchdog config, runbook, data model, API spec, E2E tests, people sync, and decision record.

Solution Selection

Selected Type

watchdog / watchdog_config

Options Considered

  1. Communication: insufficient because stability issues need monitoring and thresholds.
  2. SOP: helpful but passive.
  3. Watchdog: best fit for repeated worker health, claim, lease, queue, and result-contract failures.
  4. Full system: appropriate after D14/D30 once schema and alert rules are validated.

Recommendation

Adopt watchdog now and upgrade to system/agent when repeated signals prove the thresholds.

Adoption Status

Recommended for D1 use.

<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Codex Session Watchdog</title>
<style>
body { margin: 0; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif; background: #f6f8fb; color: #20242d; }
main { max-width: 920px; margin: 0 auto; padding: 28px 18px; }
section { background: #fff; border: 1px solid #d7dce5; border-radius: 8px; padding: 22px; }
h1 { margin: 0 0 8px; font-size: 24px; line-height: 1.25; }
h2 { margin: 22px 0 8px; font-size: 16px; }
p, li { font-size: 15px; line-height: 1.55; }
.badge { display: inline-block; color: #0f766e; border: 1px solid #9ccbc4; border-radius: 999px; padding: 4px 10px; font-size: 13px; font-weight: 650; }
.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(190px, 1fr)); gap: 10px; margin: 18px 0; }
.cell { border: 1px solid #d7dce5; border-radius: 6px; padding: 10px; background: #fbfcff; }
.label { display: block; color: #667085; font-size: 12px; margin-bottom: 4px; }
code { font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; font-size: 13px; }
</style>
</head>
<body>
<main>
<section>
<span class="badge">Worker reliability watchdog</span>
<h1>Codex Session must advance projects, not just wake up</h1>
<p>This watchdog tracks doctor, touch, claim, context, progress, artifact verification, complete status, and lease health.</p>
<div class="grid">
<div class="cell"><span class="label">D1 Owner</span>PLS platform owner</div>
<div class="cell"><span class="label">Due</span>2026-05-24 18:00</div>
<div class="cell"><span class="label">First Alert</span>complete rejected / lease expiring</div>
<div class="cell"><span class="label">Pass</span>6 golden tests defined</div>
</div>
<h2>Watch States</h2>
<ul>
<li>Worker alive but no project advancement.</li>
<li>Claimed job nearing lease expiry.</li>
<li>Complete rejected by result contract.</li>
<li>Job not found after claim.</li>
<li>Missing deliverable ID blocks upload-files.</li>
</ul>
<h2>Operator Reply</h2>
<p><code>owner=&lt;name&gt;; adopt=watchdog_config|revise; alert_channel=&lt;LINE/PLS ops&gt;; first_metric=&lt;metric&gt;; blocker=&lt;none/text&gt;</code></p>
</section>
</main>
</body>
</html>
version: codex_session_watchdog_v1
owner: pls_platform_owner
timezone: Asia/Taipei
heartbeat_interval_minutes: 10
checks:
doctor:
command: doctor
pass: ok == true and token_present == true and touch.ok == true
severity: high
touch:
command: touch
pass: ok == true and touched == true
severity: high
claim:
command: claim
pass: ok == true
severity: medium
context:
required_when: job_claimed
pass: ok == true
severity: high
progress:
required_when: context_ok
pass: status == running
severity: medium
artifact:
required_when: job_claimed
pass: primary_artifact_url_http_status in [200, 302]
severity: high
complete:
required_when: artifact_verified
pass: ok == true and status == completed
severity: high
thresholds:
lease_remaining_minutes_warn: 5
repeated_no_job_warn_count: 3
complete_rejection_warn_count: 1
job_not_found_after_claim_critical_count: 1
missing_deliverable_id_warn_count: 1
self_heal:
allowed:
- retouch_worker
- write_progress
- publish_shared_cloud_artifact
- complete_with_required_artifacts
- produce_no_job_repair_proposal
requires_human:
- force_release_running_job
- delete_or_mutate_user_data
- claim_more_than_one_primary_job
alerts:
channel: line_or_pls_ops
notify_on:
- complete_rejected
- lease_expiring_with_claimed_job
- job_not_found_after_claim
- artifact_missing
- repeated_no_job
metric threshold severity owner status next_action
lease_expiring_with_claimed_job less_than_5_minutes high pls_platform_owner pending alert_and_retouch
complete_rejected one_failure high codex_worker_maintainer pending add_missing_kind_or_escalate
job_not_found_after_claim one_failure critical pls_platform_owner pending preserve_artifact_and_alert
artifact_missing claimed_job_without_verified_artifact high codex_worker_maintainer pending publish_or_fail_with_reason
repeated_no_job three_consecutive_claims medium pls_ops pending inspect_backlog_capabilities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment