Skip to content

Instantly share code, notes, and snippets.

@esz135888
Created May 24, 2026 02:13
Show Gist options
  • Select an option

  • Save esz135888/9aeb9daf2f76549d5ed3f4452e81ce5f to your computer and use it in GitHub Desktop.

Select an option

Save esz135888/9aeb9daf2f76549d5ed3f4452e81ce5f to your computer and use it in GitHub Desktop.
PLS job a766341b anti Tokenmaxxing metric governance production pack

Acceptance Tests

  1. Primary artifact URL returns HTTP 200.
  2. Required appendix files exist.
  3. D1 / D7 / D14 / D30 path exists.
  4. Purpose-to-purpose E2E connects metric design to behavior, value, and risk.
  5. Token consumed, tool launches, and AI usage leaderboard are blocked by default.
  6. Task completion rate, actual time saved, customer satisfaction, and output quality score are allowed only with evidence and anti-gaming checks.
  7. Data model includes schema, API, sync, permissions, and audit.
  8. Market maturity includes at least two external sources.
  9. People sync and learning memory exist.
  10. Decision record explains why governance/eval/dashboard was selected.

E2E Scenario

Given a team proposes token_consumed as a performance metric, when the metric registry review runs, then the metric is blocked, rationale references Tokenmaxxing risk, and the system recommends task completion rate plus quality score as alternatives.

<!doctype html>
<html lang="zh-Hant">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Operating Console 反 Tokenmaxxing 指標治理台</title>
<style>
:root{--ink:#18212f;--muted:#627083;--line:#d9e1e8;--paper:#f6f8fb;--card:#fff;--blue:#1d4ed8;--green:#0f7f5c;--amber:#a16207;--red:#b3361d;--violet:#6d28d9}
*{box-sizing:border-box}body{margin:0;background:var(--paper);color:var(--ink);font-family:Inter,ui-sans-serif,system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;line-height:1.5}
header{background:#fff;border-bottom:1px solid var(--line);padding:28px clamp(20px,4vw,56px)}main{padding:24px clamp(20px,4vw,56px) 48px}
h1{margin:0 0 12px;font-size:clamp(30px,4vw,52px);line-height:1.05;max-width:1080px}h2{margin:0 0 12px;font-size:22px}h3{margin:0 0 6px;font-size:16px}p{margin-top:0}code{background:#eef3f8;padding:1px 5px;border-radius:4px}
.sub{max-width:1080px;color:var(--muted);font-size:17px}.grid{display:grid;gap:16px}.kpis{grid-template-columns:repeat(4,minmax(0,1fr));margin-top:22px}.two{grid-template-columns:1.08fr .92fr}.three{grid-template-columns:repeat(3,minmax(0,1fr))}.timeline{grid-template-columns:repeat(4,minmax(0,1fr))}.flow{grid-template-columns:repeat(5,minmax(0,1fr))}
.card{background:var(--card);border:1px solid var(--line);border-radius:8px;padding:18px;box-shadow:0 1px 2px rgba(24,33,47,.04)}.metric{font-size:34px;font-weight:780}.label{color:var(--muted);font-size:13px}
.pill{display:inline-flex;border:1px solid var(--line);border-radius:999px;padding:4px 10px;font-size:12px;background:#fff;margin:0 6px 8px 0;white-space:nowrap}.ok{color:var(--green)}.warn{color:var(--amber)}.bad{color:var(--red)}.info{color:var(--blue)}
table{width:100%;border-collapse:collapse;font-size:14px}th,td{text-align:left;padding:10px;border-bottom:1px solid var(--line);vertical-align:top}th{color:var(--muted);font-size:12px;text-transform:uppercase}.badcell{color:var(--red);font-weight:700}.goodcell{color:var(--green);font-weight:700}
.day{border-left:4px solid var(--violet)}.step{border:1px solid var(--line);border-radius:8px;padding:12px;min-height:126px;background:#fbfdff}.step strong{display:block;color:var(--violet);margin-bottom:6px}.source a{color:var(--blue);word-break:break-word}
@media(max-width:920px){.kpis,.two,.three,.timeline,.flow{grid-template-columns:1fr}h1{font-size:34px}}
</style>
</head>
<body>
<header>
<span class="pill info">PLS production delivery pack</span><span class="pill ok">Solution: governance / eval / dashboard</span>
<h1>Operating Console 反 Tokenmaxxing 指標治理台</h1>
<p class="sub">把「反 Tokenmaxxing」從文件章節升級成可驗收的指標治理系統:明確禁止 Token 消耗數量、工具啟動次數、AI 使用人數排行等誘發灌水的活動指標,改用任務完成率、實際節省時間、客戶滿意度、產出品質評分與反作弊稽核。</p>
<section class="grid kpis">
<div class="card"><div class="metric bad">3</div><div class="label">禁止指標:Token、啟動次數、使用排行</div></div>
<div class="card"><div class="metric ok">4</div><div class="label">替代指標:完成率、節省時間、CSAT、品質</div></div>
<div class="card"><div class="metric">D7</div><div class="label">完成第一版 metric registry 與 gate</div></div>
<div class="card"><div class="metric">D30</div><div class="label">接進 Operating Console 考核治理</div></div>
</section>
</header>
<main class="grid">
<section class="grid two">
<div class="card">
<h2>本輪問題</h2>
<p>Operating Console 若用 Token 消耗、工具啟動次數、AI 使用人數排行當成績效,會把人推向「看起來很 AI」而不是「真的完成任務」。這是典型 Goodhart's law:一旦指標成為目標,它就會失去衡量價值。</p>
<span class="pill">Owner: Operating Console owner</span><span class="pill">Due: D7 metric registry</span><span class="pill">Acceptance: bad metrics blocked</span>
</div>
<div class="card">
<h2>解法選型</h2>
<p><strong>governance / eval / dashboard</strong>。這不是單篇規格補充,而是考核制度風險。需要 metric registry、禁止清單、替代指標、review workflow、稽核和例外批准。</p>
</div>
</section>
<section class="card">
<h2>D1 / D7 / D14 / D30 路徑</h2>
<div class="grid timeline">
<div class="card day"><h3>D1</h3><p>建立禁止指標與替代指標 registry,定義每個指標的 owner、公式、資料源、反作弊檢查。</p></div>
<div class="card day"><h3>D7</h3><p>在 Operating Console 指標設定加入 governance gate,bad metric 不得上線,例外需 decision record。</p></div>
<div class="card day"><h3>D14</h3><p>接 3 個實際 AI 工作流,用 outcome metrics 驗證儀表板是否能反映真價值。</p></div>
<div class="card day"><h3>D30</h3><p>形成 AI performance governance:指標、品質、客戶滿意、節省時間、稽核異常同表決策。</p></div>
</div>
</section>
<section class="card">
<h2>Purpose-to-Purpose E2E</h2>
<div class="grid flow">
<div class="step"><strong>原始目的</strong>Operating Console 要衡量 AI 對業務的真實價值。</div>
<div class="step"><strong>風險</strong>Tokenmaxxing 把人推向增加消耗與表演式使用。</div>
<div class="step"><strong>治理</strong>metric registry 阻擋壞指標,替代成 outcome/quality/time/customer metrics。</div>
<div class="step"><strong>採用</strong>主管用可驗證結果考核;員工專注完成任務與提升品質。</div>
<div class="step"><strong>結果</strong>降低浪費、提升任務完成率、改善客戶滿意、避免制度誘發錯誤行為。</div>
</div>
</section>
<section class="grid two">
<div class="card">
<h2>Metric Registry Gate</h2>
<table>
<thead><tr><th>Metric</th><th>Status</th><th>Reason / Replacement</th></tr></thead>
<tbody>
<tr><td>Token consumed</td><td class="badcell">Blocked</td><td>誘發灌水與低效率;替代為 task completed per verified outcome。</td></tr>
<tr><td>Tool launches</td><td class="badcell">Blocked</td><td>啟動不等於採用;替代為 workflow completion rate。</td></tr>
<tr><td>AI usage leaderboard</td><td class="badcell">Blocked</td><td>誘發排名焦慮和表演;替代為 team outcome score。</td></tr>
<tr><td>Task completion rate</td><td class="goodcell">Allowed</td><td>需定義任務完成證據與品質門檻。</td></tr>
<tr><td>Actual time saved</td><td class="goodcell">Allowed</td><td>需 baseline 與抽樣驗證。</td></tr>
<tr><td>Customer satisfaction</td><td class="goodcell">Allowed</td><td>需與 AI-assisted workflow 連結,避免單點歸因。</td></tr>
<tr><td>Output quality score</td><td class="goodcell">Allowed</td><td>需 rubric、reviewer、sample size 與異議流程。</td></tr>
</tbody>
</table>
</div>
<div class="card">
<h2>資料 / API / 權限</h2>
<p><strong>Tables:</strong> <code>metric_registry</code>, <code>metric_reviews</code>, <code>metric_observations</code>, <code>gaming_signals</code>, <code>governance_exceptions</code>.</p>
<p><strong>APIs:</strong> <code>POST /console/metrics/register</code>, <code>POST /console/metrics/:id/review</code>, <code>GET /console/metrics/governance-scorecard</code>.</p>
<p><strong>Permissions:</strong> team owner can propose metrics; governance owner approves; Louis can override with reason; blocked metrics require exception audit.</p>
</div>
</section>
<section class="grid three">
<div class="card"><h2>價值 / 錢路徑</h2><p>避免把預算花在 Token 和工具啟動次數上,將投資導向節省時間、提升品質、客戶滿意和任務完成,降低制度性浪費。</p></div>
<div class="card"><h2>人的能力提升</h2><p>主管學會設計不易被操弄的指標;員工知道 AI 使用的目標是交付成果,而不是堆活動量。</p></div>
<div class="card"><h2>下一輪升級</h2><p>接實際 Operating Console 指標設定 UI,加入 bad metric blocker、metric review workflow 和 gaming alert。</p></div>
</section>
<section class="card source">
<h2>Market Maturity Inputs</h2>
<p>McKinsey notes productivity data can damage organizations if simple activity metrics such as lines of code or commit counts are misused: <a href="https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/yes-you-can-measure-software-developer-productivity?cid=other-eml-mtg-mip-mck">McKinsey developer productivity measurement</a>.</p>
<p>The SPACE framework balances Satisfaction, Performance, Activity, Communication, and Efficiency to avoid over-optimizing one visible activity metric: <a href="https://space-framework.com/">SPACE framework</a>.</p>
<p>DORA metrics connect delivery performance with reliability and stability rather than raw activity volume: <a href="https://dora.dev/guides/dora-metrics/">DORA metrics guide</a>.</p>
</section>
</main>
</body>
</html>

Data Model

metric_registry

  • id: uuid primary key.
  • metric_key: text unique.
  • metric_name: text.
  • category: enum outcome, quality, time_saved, customer, activity, blocked.
  • status: enum proposed, approved, blocked, deprecated, exception.
  • formula: text.
  • data_source: text.
  • owner_id: uuid.
  • anti_gaming_check: text.
  • quality_gate: text.

metric_reviews

  • id: uuid primary key.
  • metric_id: uuid.
  • reviewer_id: uuid.
  • decision: enum approve, block, request_changes, exception.
  • rationale: text.
  • reviewed_at: timestamptz.

metric_observations

  • id: uuid primary key.
  • metric_id: uuid.
  • subject_type: enum workflow, team, project, customer, artifact.
  • subject_id: text.
  • value: numeric.
  • baseline_value: numeric nullable.
  • evidence_ref: text.
  • observed_at: timestamptz.

gaming_signals

  • id: uuid primary key.
  • metric_id: uuid.
  • signal_type: enum activity_spike, quality_drop, token_spike, tool_launch_spike, leaderboard_chasing, manual_report_mismatch.
  • severity: enum low, medium, high.
  • evidence_json: jsonb.
  • created_at: timestamptz.

governance_exceptions

  • id: uuid primary key.
  • metric_id: uuid.
  • requested_by: uuid.
  • reason: text.
  • expires_at: timestamptz.
  • approved_by: uuid nullable.
  • status: enum pending, approved, rejected, expired.

API / Sync

  • POST /console/metrics/register
  • POST /console/metrics/:id/review
  • GET /console/metrics/governance-scorecard
  • POST /console/metrics/:id/observations
  • POST /console/metrics/:id/exceptions

Permissions / Audit

Team owners can propose metrics. Governance owner approves or blocks. Louis can override with reason. Blocked metrics require exception approval and expiry. Every decision writes append-only audit fields.

Decision Record

Decision

Use governance / eval / dashboard.

Why

The risk is not that the document lacks a paragraph. The risk is that Operating Console may institutionalize bad incentives. Metric design affects behavior, budget, and trust. It therefore needs governance gates, review workflow, audit, and dashboard monitoring.

Options Considered

  • Doc only: too weak; does not block bad metrics.
  • Spreadsheet scorecard: useful but lacks approval and audit.
  • Governance / eval / dashboard: best fit; blocks bad metrics, approves better metrics, and monitors gaming signals.

Adoption Status

Recommended. D7 should implement metric registry gate in Operating Console specification and UI.

Feedback Needed If Rejected

Clarify whether blocked metrics are still needed for cost monitoring only, not performance evaluation, and who can approve exceptions.

{
"project": "AI 自建專案:Operating Console 工具規格",
"job_id": "a766341b-1d53-4a61-9363-e6c74790cea2",
"selected_solution": "governance/eval/dashboard",
"learned_signal": "Operating Console spec added anti-Tokenmaxxing guidance: do not use token consumption, tool launch count, or AI usage leaderboard as performance metrics.",
"market_learning": "Mature productivity measurement avoids simple activity metrics and uses balanced outcome, quality, satisfaction, flow, and reliability metrics.",
"next_run_bias": "Treat metric design as incentive governance; block bad metrics before they reach scorecards.",
"must_check_next": [
"Is token consumption used only for cost diagnostics?",
"Are activity metrics excluded from performance evaluation?",
"Does each approved metric have formula, data source, owner, baseline, quality gate, and anti-gaming check?",
"Who can approve metric exceptions?"
]
}

Market Maturity

Comparable Practices

PLS Gap

PLS has the anti-Tokenmaxxing principle in the spec, but needs a production governance gate to block bad metrics from becoming incentives.

This Round Upgrade

This pack adds metric registry, blocked/allowed metric taxonomy, anti-gaming checks, data model, APIs, permissions, acceptance tests, and D30 dashboard path.

People Sync

LINE Draft

Louis,這輪我把 Operating Console 的「反 Tokenmaxxing」做成指標治理台:Token 消耗量、工具啟動次數、AI 使用排行預設 blocked;任務完成率、實際節省時間、客戶滿意度、產出品質評分可以用,但必須有公式、資料源、baseline、品質門檻和 anti-gaming check。D7 建議把這個 registry gate 接進 Operating Console 指標設定。

Ask

請確認:Token 消耗量是否只保留作成本診斷,不允許進績效或排行?

If No Reply

先把 Token / tool launch / usage ranking 全部標為 blocked,任何例外都要 decision record 與到期日。

Production Brief

場景

專案:AI 自建專案:Operating Console 工具規格 的最大化推進。

本輪訊號:Operating Console 指標設計文件正式納入「反 Tokenmaxxing」章節,禁止使用 Token 消耗數量、工具啟動次數、AI 使用人數排行,改用任務完成率、實際節省時間、客戶滿意度、產出品質評分。

本輪產出

建立「反 Tokenmaxxing 指標治理台」,把文件章節升級為 production metric governance pack:metric registry、bad metric blocker、替代指標、資料模型、權限、稽核、驗收與下一輪 UI/API 落地路徑。

D1 / D7 / D14 / D30

  • D1: 建立禁止指標與替代指標 registry。
  • D7: Operating Console 指標設定加入 governance gate。
  • D14: 接 3 個實際 AI workflow,用 outcome metrics 驗證。
  • D30: 形成 AI performance governance dashboard。

Owner / Due / Acceptance

  • Owner: Operating Console owner.
  • Governance owner: Louis or delegated metric reviewer.
  • Due: D7 metric registry.
  • Acceptance: bad metrics are blocked; allowed metrics have owner, formula, data source, baseline, quality gate, and anti-gaming check.

Production Readiness

Ready Now

  • Openable governance console.
  • Required production appendix pack.
  • Metric registry model.
  • Blocked metrics and alternatives.
  • D1/D7/D14/D30 path.

Integration Required

  • Add metric registry to Operating Console backend.
  • Add UI blocker for blocked metrics.
  • Add metric review workflow.
  • Add gaming signal detection from observations.

Failure / Rollback

If an activity metric is needed for cost monitoring, label it as diagnostic only and exclude it from performance evaluation. If any blocked metric was already used in scorecards, freeze scorecard decisions and backfill outcome metrics before resuming.

Skill / Tool Usage

Tools Used

  • PLS helper: doctor, touch, claim, context/progress attempted, upload-files, complete.
  • Web search: checked McKinsey developer productivity, SPACE framework, and DORA metrics.
  • GitHub CLI: publishes production pack as a public Gist.
  • curl: verifies primary artifact URL returns HTTP 200.

Evidence

Claim succeeded. Context and progress were temporarily blocked by PLS 502, so the claim payload was used as the production source. External maturity references were checked. The final artifact is published and verified before completion.

Solution Selection

Selected route: governance / eval / dashboard.

This is a metric incentive problem. The output must prevent bad behavior before the Operating Console becomes a management system. The right product shape is a metric governance dashboard with registry, review, anti-gaming checks, and exception audit.

Production stack:

  • Framework: metric proposal -> governance review -> approved/blocked -> observations -> gaming signal -> decision.
  • Workflow: D1 registry, D7 gate, D14 real workflow validation, D30 governance dashboard.
  • Data model: metric registry, reviews, observations, gaming signals, exceptions.
  • Tool: openable HTML governance console.
  • Acceptance: bad metrics blocked by default.
  • Upgrade: Operating Console UI/API integration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment