jeremylongshore/0-j-rig-binary-eval-one-pager-and-operator-audit.md

Created March 25, 2026 01:05

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jeremylongshore/d1c4570a8dd54cba6517c56a3dae17f5.js"></script>
Save jeremylongshore/d1c4570a8dd54cba6517c56a3dae17f5 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

0-j-rig-binary-eval-one-pager-and-operator-audit.md

j-rig-binary-eval

Software-grade release discipline for Claude Skills

Binary evaluation harness that treats SKILL.md artifacts as production software. Package integrity, trigger precision, functional quality, regression gating, baseline comparison, model-aware testing, and evidence-backed rollout decisions — all through binary yes/no criteria with external evaluators.

Links: GitHub · Master Blueprint · Epic Index

One-Pager

The Problem

Claude Skills ship on instinct. A skill author writes a SKILL.md, eyeballs it, maybe runs it once, and pushes. There is no regression gate, no trigger precision measurement, no baseline comparison, no model-variance tracking, and no evidence trail for rollout decisions. When a skill breaks silently after a model update or a description tweak causes sibling confusion across a pack, nobody knows until users complain.

The Solution

J-Rig Binary Eval is a seven-layer evaluation harness that scores every skill change across seven product surfaces before it ships:

Package Integrity — Does it parse, validate, and reference real files?
Trigger Quality — Does it fire on the right prompts and stay silent on the wrong ones?
Functional Quality — Does it complete its task and produce correct artifacts?
Regression Protection — Did this change break anything that previously worked?
Baseline Value — Does the skill actually outperform the naked model?
Model Variance — Does it work across Haiku, Sonnet, and Opus?
Rollout Safety — Any prompt leakage, overreach, or unsafe automation?

Every criterion is binary (yes/no). The evaluator is always separate from the skill under test. Observed behavior outranks claimed behavior.

W5


Who	Claude Skill authors, skill pack maintainers, enterprise skill library operators
What	Evaluation harness + regression gate + optimization engine for Claude Skills
Where	Local CLI (author workflow), CI/CD (PR gate), team dashboard (reporting)
When	Every skill change: new skill, description edit, body rewrite, model update
Why	Skills are production software — they need release-quality discipline, not vibes

Stack

Layer	Technology
Runtime	TypeScript, Node.js 20+, pnpm
CLI/Parsing	commander, @clack/prompts, picocolors, yaml, unified/remark
Validation	zod
LLM Integration	@anthropic-ai/sdk
Persistence	better-sqlite3, drizzle-orm
Concurrency	p-limit, async-retry
Artifact Extraction	pdf-parse, mammoth
Dashboard (future)	Next.js, Tailwind, shadcn/ui

Key Differentiators

Binary criteria only — if a criterion can't be answered yes or no, it isn't ready. No fuzzy scores, no vibes.
External evaluators — the skill under test never judges itself. Deterministic checks first, LLM judges second.
Sacred regressions — a change that improves average score but breaks a sacred case is rejected. Period.
One change at a time — the optimizer proposes exactly one atomic change per experiment. No multi-variable confusion.
Baseline gating — if the base model already does the job without the skill, the skill gets flagged for obsolete review.
Model-aware — Haiku, Sonnet, and Opus are tested independently. Model variance is product reality, not noise.
Evidence-backed rollout — every ship/no-ship decision comes with a structured evidence trail.

Operator-Grade System Analysis

Architecture (Seven Layers)

┌─────────────────────────────────────────────────┐
│                   CLI / CI / API                 │  Layer 7: Surfaces
├─────────────────────────────────────────────────┤
│                 Evidence Layer                   │  Layer 6: Persistence
├─────────────────────────────────────────────────┤
│               Optimization Layer                 │  Layer 5: Experiments
├─────────────────────────────────────────────────┤
│                Judgment Layer                    │  Layer 4: Scoring
├─────────────────────────────────────────────────┤
│              Observation Layer                   │  Layer 3: Capture
├─────────────────────────────────────────────────┤
│               Execution Layer                    │  Layer 2: Harness
├─────────────────────────────────────────────────┤
│                  Spec Layer                      │  Layer 1: Contracts
└─────────────────────────────────────────────────┘

Layer	Responsibility	Key Entities
Spec	Human-authored YAML eval contracts, criteria, test cases	`eval_specs`, `criteria`, `test_cases`
Execution	Runs skills against trigger, functional, regression, adversarial, baseline cases	`runs`, `skill_versions`
Observation	Captures outputs, artifacts, cost, latency, timing, observed outcomes	`observed_outcomes`, `outputs`
Judgment	Deterministic checks first, external LLM judges second, calibration, disagreement handling	`criterion_results`
Optimization	Failure clustering, weakest-criterion targeting, single atomic changes, accept/reject/revert	`experiments`
Evidence	Stores runs, scores, artifacts, diffs, regressions, baselines, launch reports	`regressions`, `baselines`, `launch_reports`
CLI/CI/API	Local author workflows, PR gating, team reporting, dashboard	—

Epic Roadmap (10 Epics, Sequential)

#	Epic	Scope
01	Repo Foundation	Workspace skeleton, governance, CI
02	Spec Layer	YAML eval contracts, criteria schema, test case format
03	Package Integrity	Deterministic structure/metadata validation
04	Evidence Layer	SQLite persistence, run lifecycle, evidence serialization
05	Trigger Harness	Roster builder, trigger simulation, precision/recall
06	Functional Execution	Skill invocation, context injection, artifact capture
07	Judgment Layer	Binary judge engine, calibration, per-model matrix
08	Regression/CLI/CI	Regression comparison, baseline gating, score aggregation, CLI, PR gate
09	Optimizer	Failure clustering, one-change proposals, experiment runner
10	Team Product	Dashboard, eval packs, drift reevaluation, obsolete-review

Non-Negotiable Design Principles

Criteria must be binary — yes or no, no gradients
Evaluator is always separate — the skill never judges itself
Observed behavior outranks claimed behavior — grade what happened, not what the skill says it does
Regression tests are sacred — a regression on a sacred case blocks release regardless of average improvement
One change at a time — optimizer proposes exactly one atomic change per experiment
Blockers block release — a blocker failure cannot be averaged out
Baseline value matters — if the naked model matches the skill, flag for obsolete review
Model-aware testing is required — Haiku/Sonnet/Opus differences are product reality

Reference Library (32 files)

Self-contained library of templates, reference standards, agent patterns, and workflow diagrams:

Directory	Contents
`templates/skill-templates/`	6 SKILL.md structural patterns
`templates/eval-schemas/`	Eval JSON schemas
`references/skill-standards/`	AgentSkills.io spec, source-of-truth, frontmatter, validation rules
`references/eval-patterns/`	Eval methodology, workflows, output patterns
`references/agents/`	Grader, comparator, analyzer agent patterns
`references/enterprise-standards/`	100-point rubric, production validator schema registry
`references/drift-and-consistency/`	Drift categories, source-of-truth hierarchy
`references/epic-workflows/`	10 ASCII workflow diagrams (one per epic)

Current Status

Phase: Planning complete, Epic 01 in progress (repo foundation).

No application code yet — the repository contains governance files, 10 epic reference documents, the master build blueprint, and a 32-file reference library mapped to specific implementation beads.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.2.7] - 2026-03-25

Added

Templates & references library (32 files)
- 6 skill templates from skill-creator (Tier 1)
- Eval JSON schemas
- 4 skill-standards references (AgentSkills.io spec, source-of-truth, frontmatter, validation)
- 3 eval-patterns references
- 3 agent patterns (grader, comparator, analyzer)
- 2 enterprise standards
- 2 drift-and-consistency references
- 10 epic workflow diagrams
Epic reference documents 05-10 (6 files, ~3000 lines)
Pattern A README with one-pager and operator-grade system analysis

Changed

Audited library for bloat: removed 9 files (975 lines) already consumed or wrong product
Added cross-reference headers to skill-standards files
Mapped all library files to specific beads (43 bd update commands)

[0.2.6] - 2026-03-24

Added

Epic 04 reference file (evidence layer, persistence, run lifecycle)

[0.2.5] - 2026-03-24

Added

Epic 03 reference file (package integrity and deterministic checks)

[0.2.4] - 2026-03-24

Added

Epic 02 reference file (spec layer and contract system)

[0.2.3] - 2026-03-24

Added

Epic index and Epic 01 reference file

[0.2.2] - 2026-03-24

Added

Master build blueprint (007-PP-PLAN)

[0.2.1] - 2026-03-24

Fixed

Clean up duplicate CHANGELOG entry from release workflow

[0.2.0] - 2026-03-24

Added

Beads issue tracking integration
Document filing index (000-INDEX.md)

[0.1.0] - 2026-03-24

Added

Initial project setup with full governance
README, LICENSE, CODE_OF_CONDUCT, CONTRIBUTING, SECURITY, SUPPORT
CI/CD workflows (lint, test, release automation)
Enterprise documentation set (6-doc planning suite)
GitHub issue templates and PR template
Dependabot configuration
EditorConfig and gitattributes

Author: Jeremy Longshore · Intent Solutions License: MIT

jeremylongshore/0-j-rig-binary-eval-one-pager-and-operator-audit.md

Select an option

No results found

Select an option

No results found