You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR/deployment: not claimed. The PLS context provided no repository, branch, deployment target, or backend access.
Upload-files note: PLS context returned deliverable: null, so no deliverable_id was available for the fixed upload-files helper. This pack uses a shared-cloud Gist.
Verification:
Gist published publicly.
HTTP and GitHub CLI verification completed before PLS completion writeback.
PLS workers can appear alive while projects do not advance, artifacts are missing, complete is rejected, or jobs become unavailable after claim. This creates invisible delivery risk.
Options
Write a LINE warning only.
Create an SOP for worker operators.
Create watchdog config plus runbook, data model, E2E tests, and people sync.
Build a full reliability console immediately.
Recommendation
Choose option 3.
Reasoning
The topic is explicitly worker stability and self-repair. Watchdog config creates measurable thresholds and bounded actions without pretending a full deployed system exists.
Adoption Status
Recommended for D1 implementation.
Feedback Needed If Not Adopted
Provide target alert channel, PLS backend table names, worker dashboard owner, and whether auto-release is allowed for expired leases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"PLS worker reliability should be judged by project advancement evidence, not only heartbeat liveness.",
"The recurring failure classes are claim/no-job loops, lease expiry, missing deliverable IDs, artifact gaps, complete rejection, and job-not-found after claim.",
"A watchdog should preserve evidence and escalate rather than pretending a failed complete succeeded."
],
"next_round": [
"Implement worker_heartbeat, worker_job_attempt, and worker_watchdog_event storage in PLS.",
"Add first alert for complete_rejected and lease_expiring_with_claimed_job.",
"Create a reliability console once D1/D7 events are collected."
Google SRE guidance distinguishes monitoring symptoms and causes, and emphasizes alerts that require timely human action. This supports alerting on worker stuck symptoms such as lease expiry, missing artifact, and complete rejection.
Google SRE also frames toil as manual, repetitive, automatable operational work. A Codex worker watchdog should reduce toil by turning repeated claim/lease/queue checks into structured events and bounded self-heal actions.
OpenTelemetry describes observability around signals such as logs, metrics, and traces. PLS should model worker heartbeat, job attempts, and watchdog events as observable signals rather than unstructured chat output.
PLS Maturity Rating
Current maturity: Level 2 of 5.
Reason: doctor/touch/claim exists and workers can produce artifacts, but stuck states and result-contract failures are not yet a first-class reliability surface.
Target D30 maturity: Level 4 of 5.
Reason: PLS should have per-worker health, stuck classification, artifact evidence rate, result-contract failure rate, and escalation runbook.
Hermes detected repeated Codex Session / worker stability signals. The latest evidence says E2E signals should advance projects and strengthen the huber persona. This points to a reliability requirement: workers must not merely wake, touch, and claim; they must detect stuck claim/lease/queue states, self-heal where allowed, and create evidence that projects moved forward.
D1 / D7 / D14 / D30 Path
D1: Install the watchdog spec as an operating contract. Track doctor, touch, claim, context, progress, artifact publication, and complete result per run.
D7: Add stuck detection for lease expiry, repeated no-job claims, incomplete result contracts, missing deliverable IDs, and complete failures.
D14: Add self-heal runbook actions: re-touch, release/fail with reason, escalate contract gaps, and create repair proposals when no eligible tasks exist.
D30: Upgrade into a PLS worker reliability console: per-worker health, queue stuck age, result contract failures, artifact evidence rate, and project advancement score.
Purpose-To-Purpose E2E
Original purpose: when a signal arrives, the Codex Session worker should advance the project and strengthen the huber persona.
Output: watchdog config, runbook, data model, API/sync spec, E2E acceptance tests, and escalation LINE draft.
Human adoption: PLS operator uses the watchdog to see which workers are healthy, stuck, producing artifacts, or silently looping.
Project/money/risk improvement: fewer dead jobs, fewer missing artifacts, faster recovery from lease/queue issues, less manual checking, and more reliable AI project delivery.
Measurable loop: heartbeat fires -> doctor/touch/claim/context/progress logged -> job processed or no-job proposal made -> artifact/complete verified -> watchdog records pass/fail -> owner receives escalation when stuck.
Value And Money Path
Revenue protection: reliable worker delivery makes PLS more trustworthy for paid project automation.
Cost saving: reduces manual babysitting of stuck sessions and repeated incomplete completes.
Risk reduction: catches silent failures before users believe a project advanced when it did not.
Conversion: improves confidence that AI-native projects produce durable artifacts, not transient chat summaries.
Human leverage: operators get clear failure categories and next action instead of reading raw logs.
Owner / Due / Acceptance
Owner: PLS platform owner and Codex Session worker maintainer.
Due: 2026-05-24 18:00 Asia/Taipei for D1 watchdog acceptance.
Acceptance:
Watchdog status schema covers doctor, touch, claim, context, progress, artifact, complete, and lease.
Stuck states have thresholds and escalation.
Self-heal actions are bounded and auditable.
E2E tests cover no job, claimed job, missing deliverable, complete rejection, and job not found.
Lease expiring: claimed job has less than 5 minutes remaining and no complete result.
Queue stuck: claim repeatedly returns no eligible job while backlog exists or capability mismatch is suspected.
Result contract failure: complete returns missing artifact kind, no primary artifact, or non-production result.
Job missing after claim: complete or context returns not found after the worker had a running job.
Silent loop: doctor/touch/claim succeed but no project advancement artifact appears for three heartbeats.
Actions
Retouch worker and confirm doctor is healthy.
If job is claimed, run context and write progress immediately.
If deliverable ID is missing, publish shared-cloud artifact and record upload-files as blocked.
If complete is rejected for artifact kind, add the missing required kind before retrying.
If job not found after claim, alert PLS ops with artifact URL and failure reason.
If no job exists, produce a concise backlog/capability repair proposal instead of ending at touch/claim.
Escalation Message
PLS worker stability alert: worker is alive but project advancement is blocked. State=<state>; job=<job_id>; evidence=<url_or_log>; next action=<retouch|retry_complete|repair_contract|inspect_queue>; owner=PLS platform owner; due=<time>.
Rollback
Do not reset user files or force-release jobs automatically. If a job appears inconsistent, preserve evidence, alert owner, and let PLS decide whether to requeue.
Source: company_signal_mastery
Topic key: codex-session-stability
Recent signal count: 11
Related people: 3
Related projects: 3
Latest evidence: E2E test says incoming signals must advance projects and strengthen huber persona.
Project Annotation
AI-native project 9c53826a-5f44-44da-b296-0b86fec226a6 should be treated as a reliability and self-repair project.
Person Annotation
Related profiles are operational reviewers or affected stakeholders until role names are resolved.
Decision Annotation
Decision: use watchdog, not LINE-only or generic analysis.
Risk Annotation
Worker alive but not advancing projects.
Lease expires mid-job.
Complete rejected after artifact work is done.
Missing deliverable ID blocks upload-files.
Job disappears after claim.
No-job loops hide backlog/capability mismatch.
Source Project Handling
Keep the three source projects separate until the watchdog records which failure mode each project contributes. Merge only if repeated evidence shows the same root cause across projects.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters