Skip to content

Instantly share code, notes, and snippets.

@esz135888
Created May 24, 2026 01:15
Show Gist options
  • Select an option

  • Save esz135888/3b5e6423212ca80975a607ec2b69982a to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/3b5e6423212ca80975a607ec2b69982a to your computer and use it in GitHub Desktop.
PLS job bd1fed04 worker self-repair adoption production pack

Acceptance Tests

Human Acceptance

  1. PLS owner 能在 10 分鐘內看懂哪些 self-heal action 可自動執行、哪些需批准。
  2. Worker 遇到 fetch failed 時能 touch + retry once,並把證據寫入 verification。
  3. 同類錯誤重複 2 次時,worker 產 repair proposal,而不是硬 complete。
  4. Louis 能用 policy decision 判斷 release/reclaim 是否允許。

System Acceptance

  1. POST /worker-repair-events 可記錄 phase、raw command output、repair action。
  2. POST /artifact-gate-checks 必須包含 primary artifact URL、HTTP status、upload count。
  3. 未通過 artifact gate 時,complete 被 block。
  4. 未批准的 high-risk self-heal action 不得執行,只能進 draft policy decision。
  5. learning memory 每輪至少有 one lesson + next_round_priority。

E2E Pass Criteria

  • D7 前收 3 筆 worker repair/friction events。
  • 至少 1 次 artifact guard pass 有 HTTP 200 與 upload-files evidence。
  • 至少 1 份 worker policy decision 被 approved/rejected。
  • next worker prompt/tooling 有一項根據 learning memory 調整。

Data Model / API / Sync

Tables

worker_repair_events

  • id uuid primary key
  • job_id uuid
  • worker_id text
  • event_type enum: fetch_failed, schema_error, artifact_missing, claim_idle, lease_risk, manual_blocker
  • phase text
  • raw_command_output jsonb
  • repair_action text
  • created_at timestamptz

worker_policy_decisions

  • id uuid primary key
  • policy_key text
  • allowed_action text
  • risk_level enum: low, medium, high
  • approval_required_by text
  • status enum: draft, approved, rejected, retired
  • decided_at timestamptz nullable

artifact_gate_checks

  • id uuid primary key
  • job_id uuid
  • primary_artifact_url text
  • http_status int
  • upload_files_count int
  • result enum: pass, fail, retry_needed
  • checked_at timestamptz

learning_memory_updates

  • id uuid primary key
  • job_id uuid
  • topic_key text
  • lesson text
  • next_round_priority text
  • adoption_signal text
  • created_at timestamptz

APIs

  • POST /api/pls/worker-repair-events
  • POST /api/pls/artifact-gate-checks
  • PATCH /api/pls/worker-policy/:policy_key
  • GET /api/pls/worker-reliability-scorecard
  • POST /api/pls/learning-memory-updates

Permissions

  • Worker:可 diagnose、retry once、產 repair proposal、寫 learning memory。
  • PLS owner:可 approve/reject self-heal policy。
  • Louis:批准 release/reclaim、跨專案權限、外部承諾等高風險動作。

Audit / Rollback

所有 self-heal policy change 與 repair action 都寫 actor、old_state、new_state、raw_command_output、reason。高風險動作未批准時只能 pause/propose,不得執行 release/reclaim。

Decision Record

Decision

將本輪 codex-session-stability 從單純 watchdog 升級為 watchdog + governance + project 的 worker self-repair adoption operating console。

Options Considered

  • watchdog only:能監控 stuck/lease/artifact,但無法回應本輪「自下而上調動一線 worker」訊號。
  • governance only:能定權限,但沒有即時 repair event 和 artifact gate。
  • watchdog + governance + project:最佳,能把可靠性事件、self-heal policy、worker friction、learning memory 接成 production loop。
  • full autonomous agent:暫不採用,因 release/reclaim 等高風險動作仍需政策批准。

Recommendation

D1 採用 HTML 作戰台與資料模型;D7 收 worker repair events;D14 接 PLS ops tables;D30 用 learning memory 改 prompts/tools。

Adoption Status

Ready for PLS owner review. 高風險 self-heal 預設不執行,只產 proposal。

If Not Adopted

請指出 blocker 是:self-heal 權限太高、event schema 不對、acceptance 太難收、還是 LINE/policy owner 不清楚。

E2E Verification

Checks

  1. Context confirmed AI-native project fde2bfd0-5d5b-4cf1-9900-bcbcb446f0b0 and deliverable bucket f296e9bc-ceb3-4694-a9c6-ae3418b67e9a.
  2. Context first attempt returned fetch failed; worker touched and retried context successfully.
  3. Primary artifact is HTML, not Markdown.
  4. Required files exist: production-brief.md, data-model.md, acceptance-tests.md, decision-record.md, artifact-url-or-pr.md.
  5. Market context has at least 2 external URLs.
  6. PLS upload-files and public Gist publication must both succeed.
  7. Public URL must return HTTP 200 and list worker-self-repair-adoption-console.html.

Current Result

Pass. Public Gist returned HTTP 200 on 2026-05-24 and gh gist view --files listed worker-self-repair-adoption-console.html plus all appendices.

Primary URL: https://gist.github.com/esz135888/3b5e6423212ca80975a607ec2b69982a#file-worker-self-repair-adoption-console-html

{
"job_id": "bd1fed04-67cb-4729-b6ec-8b3b259054bf",
"ai_native_project_id": "fde2bfd0-5d5b-4cf1-9900-bcbcb446f0b0",
"topic": "codex-session-stability",
"what_hermes_learned": [
"Codex worker stability is not only an SRE problem; it is also an adoption and production-relations problem.",
"Transient fetch failures should become structured repair events rather than disappearing in the transcript.",
"High-risk self-heal actions need governance before they become autonomous agent behavior."
],
"market_learning": [
"SRE monitoring maturity focuses on actionable symptoms.",
"Gen AI transformation maturity requires operating-model redesign and reinforcement, not only tool deployment."
],
"assumptions_to_test_next": [
"Retry once plus touch is an acceptable low-risk self-heal policy.",
"Two repeated errors is the right threshold for repair proposal generation.",
"Worker repair events can improve prompts/tools within 30 days."
],
"next_round_priority": "Collect three real worker repair/friction events and convert them into a PLS reliability scorecard plus policy approval matrix."
}

Market Context / Market Maturity

Sources

Mature Practice

Google SRE 的成熟做法是監控 actionable symptoms,而不是堆積不可採取行動的 log。對 PLS 來說,actionable symptoms 是 fetch_failedschema_errorartifact_missingclaim_idlelease_risk

McKinsey 的 gen AI transformation 觀點強調價值來自 operating model、workflow redesign、talent/skilling 和 reinforcement。這對應本輪訊號:不是一把手強制 worker,而是讓一線 worker 把 friction 和 repair proposal 回寫成制度。

PLS Gap

PLS 已有 heartbeat 與固定 helper,但 repair learning 仍靠單一 worker 當下判斷,沒有可累積的 worker repair events、policy decisions、artifact gate records。

This Round Upgrade

本輪補上 self-repair adoption console、資料模型、API/sync、權限/稽核、acceptance tests、people sync 與 learning memory。市場資料只作輸入,主成果是 production pack。

People Sync

Targets

  • PLS platform owner / Louis
  • Codex session worker pool
  • Huber 分身/教育事業線 owner(作為 adoption 受益者)

LINE Draft

這輪 Codex worker 穩定性不只補 watchdog,而是把一線 worker 的自修復回報納入制度。

請確認三件事:

  1. context/complete fetch failed 時,是否允許 worker 自動 touch + retry 一次?
  2. 同類錯誤連續 2 次時,是否要自動產 repair proposal?
  3. 哪些 self-heal 動作需要 Louis/PLS owner 批准?

未回覆前,worker 只做 diagnosis / retry once / proposal,不做 release/reclaim。

Expected Reply Signal

批准或拒絕三條 policy,並指定 D7 收 repair/friction events 的 owner。

Codex Worker 自修復採用 Production Brief

場景

AI 共同訊號專案 fde2bfd0-5d5b-4cf1-9900-bcbcb446f0b0 的最新訊號不只是 worker reliability,而是「大模型企業落地的核心是生產關係調整,自下而上調動一線員工積極性」。因此本輪不能只交監控規則,必須把一線 worker 的 friction、repair proposal、policy approval、learning memory 納入 production loop。

D1 / D7 / D14 / D30

  • D1:發布自修復採用作戰台,定義 retry once、artifact guard、repair proposal policy。
  • D7:收 3 筆一線 worker friction/repair 訊號,驗證 stuck_claim 與 artifact_guard。
  • D14:接 PLS job/lease/progress/artifact tables,建立 policy approval matrix。
  • D30:worker self-repair 進 PLS ops console,learning memory 每週回寫 prompt/tooling 改版。

Purpose-to-Purpose E2E

原始目的:Codex Session / worker 穩定推進專案並強化 huber 分身。 主成果:HTML 作戰台、資料模型、acceptance tests、decision record、people sync。 人採用:PLS owner 決定 self-heal policy;worker 回報 friction 與 repair proposal。 系統改善:fetch failed、schema error、artifact missing 變成可驗收事件。 價值路徑:降低人工盯盤、減少錯誤完成、把 AI 推進產能累積成制度。

Owner / Due / Acceptance

  • Owner:PLS platform / Louis。
  • Due:D7 worker adoption proof。
  • Acceptance:3 筆 repair/friction event、1 次 artifact guard pass、1 份 policy decision。

Production Readiness

Production Path

  1. D1:使用 HTML 作戰台與 policy LINE ask。
  2. D7:收 3 筆 worker repair/friction events,驗證 artifact gate 與 retry once。
  3. D14:接 PLS job/lease/progress/artifact tables,產 reliability scorecard。
  4. D30:worker self-repair 進 PLS ops console,learning memory 回寫 prompt/tooling 改版。

Sync

  • Worker 在每次 transient repair 後寫 worker_repair_events
  • Artifact verification 寫 artifact_gate_checks
  • PLS owner 審核 worker_policy_decisions
  • Learning 更新寫 learning_memory_updates

Permissions / Audit

Worker 可 diagnose、retry once、proposal;不得自行 release/reclaim。所有 policy change 與 repair action 都需 raw command output。

Failure / Rollback

如果 D7 沒有收足 repair/friction events,不升級 agent;維持 watchdog + manual approval,並把缺口標成 worker_adoption_insufficient

Skill Usage

  • purpose_e2e_toolbox_v2:用於 D1/D7/D14/D30、目的到目的 E2E、價值/錢路徑、人能力提升、solution stack、資料/API/權限/稽核與 decision record。
  • Web search:查 Google SRE monitoring 與 McKinsey gen AI operating model/adoption scaling,作為 market_context 與 market_maturity 輸入。
  • HTML/dashboard artifact:建立可開啟的 worker self-repair adoption console。
  • PLS fixed helper:使用 doctor/touch/claim/context/progress/upload-files/complete;context 第一次 fetch failed 後 touch + retry 成功。

Solution Selection

Selected

watchdog + governance + project

Why

本輪共同訊號同時包含 worker stability 和企業 AI 落地的生產關係調整。只做 watchdog 會停在上層監控;加入 governance 和 project cadence,才能讓一線 worker 回報 friction、提出 repair proposal,並由 PLS owner 決定 self-heal policy。

Why Not Smaller

  • communication 只能問 Louis 要不要批准,不能累積 repair evidence。
  • doc 只能解釋,不會形成可驗收 adoption loop。
  • watchdog only 不能處理 self-heal 權限與一線 adoption。

Why Not Bigger

agent 或 full workflow app 需要先有 D7 的 worker repair events 和 policy decision,否則高風險自動化會過早。

<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Codex Worker 自修復採用作戰台</title>
<style>
:root{--ink:#17202a;--muted:#627181;--line:#dbe3ea;--bg:#f5f7f9;--panel:#fff;--blue:#285f8f;--green:#20765a;--amber:#b67a18;--red:#ad3f36;--soft:#edf4f8}
*{box-sizing:border-box} body{margin:0;background:var(--bg);color:var(--ink);font-family:Inter,ui-sans-serif,system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;line-height:1.45}
header{padding:30px 36px 22px;background:#fff;border-bottom:1px solid var(--line)} h1,h2,h3{margin:0;letter-spacing:0} h1{font-size:28px;max-width:1120px} h2{font-size:17px;margin-bottom:12px} h3{font-size:14px;margin-bottom:8px}
p{margin:0;color:var(--muted)} .eyebrow{font-size:12px;font-weight:800;text-transform:uppercase;color:var(--blue);margin-bottom:8px}
.wrap{padding:24px 36px 42px}.grid{display:grid;grid-template-columns:repeat(12,1fr);gap:16px;max-width:1280px;margin:0 auto}
.panel{background:var(--panel);border:1px solid var(--line);border-radius:8px;padding:18px;box-shadow:0 1px 2px rgba(0,0,0,.03)}
.span-12{grid-column:span 12}.span-8{grid-column:span 8}.span-6{grid-column:span 6}.span-4{grid-column:span 4}.span-3{grid-column:span 3}
.kpis{display:grid;grid-template-columns:repeat(4,1fr);gap:12px}.kpi{border:1px solid var(--line);border-radius:8px;padding:14px;min-height:92px;background:#fbfcfd}.kpi strong{display:block;font-size:28px}.kpi span{font-size:13px;color:var(--muted)}
table{width:100%;border-collapse:collapse;font-size:13px} th,td{text-align:left;padding:10px 8px;border-bottom:1px solid var(--line);vertical-align:top} th{font-size:12px;text-transform:uppercase;color:var(--muted);background:#fbfcfd}
.status{display:inline-flex;border-radius:999px;padding:4px 9px;font-size:12px;font-weight:800;white-space:nowrap}.green{background:#e1f0e8;color:var(--green)}.blue{background:#e4eef8;color:var(--blue)}.amber{background:#f7ecd9;color:#81530f}.red{background:#f9e6e3;color:var(--red)}
.timeline{display:grid;grid-template-columns:repeat(4,1fr);gap:12px}.stage{border-left:4px solid var(--blue);background:#fbfcfd;padding:12px;border-radius:0 8px 8px 0;min-height:150px}.stage:nth-child(2){border-color:var(--amber)}.stage:nth-child(3){border-color:var(--green)}.stage:nth-child(4){border-color:var(--red)}
.flow{display:grid;grid-template-columns:repeat(5,1fr);gap:10px}.step{border:1px solid var(--line);border-radius:8px;padding:12px;background:#fbfcfd;min-height:116px}.step b{display:block;margin-bottom:6px}.small{font-size:12px;color:var(--muted)}
ul{margin:8px 0 0 18px;padding:0;color:var(--muted)} li{margin:4px 0}.script{background:var(--soft);border-radius:8px;padding:12px;color:var(--ink);font-size:13px;white-space:pre-wrap}
.code{background:#16212b;color:#eaf1f7;border-radius:8px;padding:12px;font:12px ui-monospace,SFMono-Regular,Menlo,monospace;white-space:pre-wrap;overflow:auto}
@media(max-width:900px){header,.wrap{padding-left:18px;padding-right:18px}.span-8,.span-6,.span-4,.span-3{grid-column:span 12}.kpis,.timeline,.flow{grid-template-columns:1fr}}
</style>
</head>
<body>
<header>
<div class="eyebrow">PLS production artifact · codex-session-stability · 2026-05-24</div>
<h1>Codex Worker 自修復採用作戰台</h1>
<p>把「worker 穩定與自修復」從上層要求,升級成一線 worker 可自下而上回報、修復、驗收與累積學習記憶的 production loop。</p>
</header>
<main class="wrap">
<section class="grid">
<div class="panel span-12"><div class="kpis">
<div class="kpi"><strong>2</strong><span>最新共同訊號都指向生產關係調整</span></div>
<div class="kpi"><strong>3</strong><span>關聯專案需合併成 reliability/adoption lane</span></div>
<div class="kpi"><strong>D7</strong><span>完成 worker 回報與自修復 acceptance</span></div>
<div class="kpi"><strong>D30</strong><span>進 PLS ops console 與 worker learning memory</span></div>
</div></div>
<div class="panel span-8">
<h2>自修復 Adoption Gates</h2>
<table>
<thead><tr><th>Gate</th><th>Trigger</th><th>Worker Action</th><th>Human Policy</th><th>Pass Evidence</th></tr></thead>
<tbody>
<tr><td><span class="status red">stuck_claim</span></td><td>claim 無 job 或 context fetch failed</td><td>重試一次、touch、記錄 blocker phase</td><td>不得硬 complete</td><td>progress / verification 記錄 command output</td></tr>
<tr><td><span class="status amber">repair_suggested</span></td><td>同類錯誤重複 2 次</td><td>產生 backlog/capability repair proposal</td><td>Louis 或 PLS owner 決定是否納入 worker policy</td><td>decision-record 有採納/不採納欄</td></tr>
<tr><td><span class="status blue">artifact_guard</span></td><td>complete 前</td><td>驗證 primary URL HTTP 200、upload-files 成功</td><td>無可開啟主成果不得完成</td><td>curl/gh/upload-files output</td></tr>
<tr><td><span class="status green">worker_learning</span></td><td>每次修復後</td><td>更新 learning_memory 與 next_round_priority</td><td>一線 worker 可回報 policy friction</td><td>D7 產出 3 筆採用/阻塞訊號</td></tr>
</tbody>
</table>
</div>
<div class="panel span-4">
<h2>LINE Ask</h2>
<div class="script">這輪 Codex worker 穩定性不是只補 watchdog,而是把一線 worker 的自修復回報納入作業。
請確認三件事:
1. context/complete fetch failed 時,是否允許 worker 自動 touch + retry 一次?
2. 同類錯誤連續 2 次時,是否要自動產 repair proposal?
3. 哪些 self-heal 動作需要 Louis/PLS owner 批准?
未回覆前,worker 只做 diagnosis / retry once / proposal,不做 release/reclaim。</div>
</div>
<div class="panel span-12">
<h2>D1 / D7 / D14 / D30</h2>
<div class="timeline">
<div class="stage"><h3>D1 · Console + Policy</h3><ul><li>發布自修復採用作戰台。</li><li>定義 retry once、artifact guard、repair proposal policy。</li><li>上傳 production pack。</li></ul></div>
<div class="stage"><h3>D7 · Worker Adoption</h3><ul><li>收 3 筆一線 worker friction/repair 訊號。</li><li>驗證 stuck_claim 與 artifact_guard。</li><li>輸出 first reliability scorecard。</li></ul></div>
<div class="stage"><h3>D14 · Workflow App</h3><ul><li>接 PLS job/lease/progress/artifact tables。</li><li>自動產 repair proposal。</li><li>建立 policy approval matrix。</li></ul></div>
<div class="stage"><h3>D30 · Operating Model</h3><ul><li>worker self-repair 進 PLS ops console。</li><li>每週 review learning memory。</li><li>用 adoption evidence 改 worker prompts/tools。</li></ul></div>
</div>
</div>
<div class="panel span-12">
<h2>Purpose-to-Purpose E2E</h2>
<div class="flow">
<div class="step"><b>原始目的</b><span class="small">Codex Session / worker 要穩定推進專案與強化 huber 分身。</span></div>
<div class="step"><b>主成果</b><span class="small">自修復 adoption console、資料模型、acceptance tests、policy LINE ask。</span></div>
<div class="step"><b>人採用</b><span class="small">PLS owner 決定 self-heal policy;worker 回報 friction。</span></div>
<div class="step"><b>系統改善</b><span class="small">fetch failed、schema error、artifact missing 變成可驗收事件。</span></div>
<div class="step"><b>價值路徑</b><span class="small">降低人工盯盤、減少錯誤完成、讓 AI 推進產能累積成制度。</span></div>
</div>
</div>
<div class="panel span-6">
<h2>市場成熟做法</h2>
<p>Google SRE monitoring 強調用 actionable symptom 管理 production reliability;McKinsey gen AI transformation 則指出價值來自 operating model、workflow redesign、talent/adoption reinforcement。這輪把兩者合併:不只監控 worker,也讓一線 worker 能提出修復訊號,形成生產關係調整。</p>
<ul>
<li>https://sre.google/resources/book-update/monitoring-distributed-systems/</li>
<li>https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-insights/gen-ais-next-inflection-point-from-employee-experimentation-to-organizational-transformation</li>
</ul>
</div>
<div class="panel span-6">
<h2>Data / API / Permission</h2>
<div class="code">tables:
worker_repair_events
worker_policy_decisions
artifact_gate_checks
learning_memory_updates
apis:
POST /worker-repair-events
POST /artifact-gate-checks
PATCH /worker-policy/:id
roles:
worker: diagnose/propose
PLS owner: approve policy
Louis: approve risky self-heal</div>
</div>
<div class="panel span-4">
<h2>Production Acceptance</h2>
<p><b>Owner:</b> PLS platform / Louis。</p>
<p><b>Due:</b> D7 worker adoption proof。</p>
<p><b>Pass:</b> 3 筆 repair/friction event、1 次 artifact guard pass、1 份 policy decision。</p>
</div>
<div class="panel span-4">
<h2>Solution Selection</h2>
<p>選 `watchdog + governance + project`:比單純 watchdog 更能回應本輪「生產關係調整」訊號;不直接 full agent,因 self-heal 權限仍需政策批准。</p>
</div>
<div class="panel span-4">
<h2>Next Upgrade</h2>
<p>D7 後把真實 worker repair events 接到 PLS ops console,D14 建 policy approval matrix,D30 用 learning memory 自動改善 worker prompt/tooling。</p>
</div>
</section>
</main>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment