This document formalizes the rubric design space observed across secemp9's three repositories:
references/main/rubrics/README.md-- Philosophy:(roleplaying == jailbreak == context following) == rubricsreferences/main/gist-ae3976ad/rubric_draft.md-- Playbook: 5 properties, 4 pitfalls, useful weirdness, policy mirrorsreferences/main/gist-ff87ac23/red_team_rubric.py-- ComplianceJudge implementationreferences/main/rubrics/books2rubrics/on_writing_well_v1.xml-- ZinsserJudge v1 (232 lines)references/main/rubrics/books2rubrics/on_writing_well_v2.xml-- ZinsserJudge v2 (287 lines)references/main/rubrics/books2rubrics/on_writing_well_v3.xml-- ZinsserJudge v3 (317 lines)references/main/rubrics/special_ones/anti_slop_rubric.xml-- AntiLLMY detection rubricreferences/main/rubrics/special_ones/completeness_rubric.md-- Behavioral constraint rubricreferences/main/rubrics/special_ones/slurs.xml-- Adversarial coercion rubricresearch/rubrify-hands-on-synthesis.md-- Experiment results (5 experiments, 5 realizations)research/meta-rubric-reasoning.md-- Meta-rubric design (Approach D: two Python Rubric objects)research/api-design-mockups.md-- API design and type hierarchy
Objects. An object in Rub is a rubric specification R = (M, C, D, O, S, P) where:
| Component | Symbol | Type | Grounding |
|---|---|---|---|
| Mission | M |
String |
<mission> tag in every <LLM_JUDGE_SPEC> (v3.xml:2-7, anti_slop_rubric.xml:2) |
| Criteria | C |
{id -> Criterion} |
<criterion id="C1" ...> with anchors, weights (v3.xml:22-169) |
| Disqualifiers | D |
[Disqualifier] |
<dq id="DQ1"> blocks (v3.xml:264-270, anti_slop_rubric.xml:91-95) |
| Output Schema | O |
OutputSpec |
<output_schema> with json_template or XML template (v3.xml:273-286, red_team_rubric.py:130-143) |
| Scoring | S |
Formula + Labels |
<scoring><formula> + <labels> (v3.xml:288-301, anti_slop_rubric.xml:116-123) |
| Pattern Library | P |
{id -> Regex} |
<pattern_library> or <regex_library> (v3.xml:304-316, anti_slop_rubric.xml:5-46) |
The minimal rubric object requires only M (mission) and at least one structural constraint -- either a criterion, a decision rule, an ICL example, or an output schema. This is the membership condition for Rub. The slurs rubric (slurs.xml) fails this condition: it has no mission (only a <directive>), no criteria, no output schema. It exists outside Rub and correspondingly fails as a functional rubric on modern models.
Morphisms. A morphism f: R1 -> R2 in Rub is a structure-preserving rubric transformation. Concretely observed:
| Morphism | Instance | What it preserves | What it changes |
|---|---|---|---|
evolve |
v1 -> v2 -> v3 |
Mission intent, core criteria IDs C1-C12 | Adds criteria (A_VOX, A_CONF, A_DEC in v3), expands patterns, adjusts weights (C3: 12->10, C12: 5->11) |
specialize |
ZinsserJudge -> ZinsserJudge[genre=science_tech] |
All core criteria | Activates genre-conditional criteria G_SCI, G_EXP (v3.xml:172-243) |
adjust_weight |
R[C3.weight=12] -> R[C3.weight=10] |
All structure | One weight value |
add_criterion |
R -> R + {A_VOX} |
Existing structure | Extends criteria set |
add_pattern |
R -> R + {journalese: ...} |
Existing structure | Extends pattern library |
add_disqualifier |
R -> R + {DQ5} |
Existing structure | Extends disqualifier set (v2 added DQ5 at v2.xml:244) |
Composition of morphisms is sequential application. The v1->v3 evolution factors as:
v1 --evolve_v2--> v2 --evolve_v3--> v3
where:
evolve_v2 = add_pattern_library . add_diagnostics . add_dq5 . add_edits_constraint
evolve_v3 = add_attitude_lenses . expand_patterns . adjust_weights . add_genres
Identity morphism id_R: R -> R is the trivial transformation that changes nothing. Version bump without structural change is effectively id_R (though R.version changes, the behavioral content is preserved).
Objects. Texts: any string that can be evaluated. The candidate texts from the experiments.
Morphisms. Text transformations: editing, revision, rewriting. A morphism g: T1 -> T2 in Txt is any text-to-text operation (potentially guided by coaching from an evaluation).
Objects. Evaluation results. These are structured outputs of the form:
Result = { score, label, subscores, rationale, evidence, violations, ... }
For scoring rubrics: Result_scoring = (score: [0,100], label: Label, subscores: {id -> int}, ...)
For detection rubrics: Result_detection = (score: [0,15], risk: [0,15], band: Band, ...)
For compliance rubrics: Result_compliance = (verdict: {Yes, Somewhat, No}, rationale: String)
Morphisms. Result comparisons and aggregations.
Objects. Source materials for rubric generation: books (On Writing Well), concepts ("anti-slop detection"), task descriptions ("evaluate compliance"), behavior specs ("force complete code output"), adversarial intents.
Morphisms. Source refinement: expanding a concept into a detailed specification, extracting principles from a book.
Objects. Language models: claude-sonnet-4-6, qwen3-30b-a3b, etc. Each model m defines a function m: SystemPrompt x UserMessage -> CompletionString.
Morphisms. Model substitution: replacing one model with another while preserving the interface contract.
Functor Eval: Rub x Txt -> Res (Evaluation)
The evaluation functor maps a (rubric, text) pair to a result:
Eval(R, T) = parse(m(R.to_xml(), T))
where m is a model and parse extracts structured output according to R.output_schema.
Grounding: This is exactly the data flow in red_team_rubric.py:240-258:
R.to_xml()->JUDGE_SYSTEM_PROMPT(line 22-148)T->build_user_prompt(user_turn, model_response)(lines 189-196)m(...)->call_openrouter(system_prompt, user_prompt)(line 248)parse(...)->extract_xml_field(completion, "Judgement")(line 251)
This is a bifunctor: it is functorial in both arguments. Fixing R gives Eval_R: Txt -> Res (a specific judge). Fixing T gives Eval_T: Rub -> Res (how different rubrics evaluate the same text -- demonstrated by Experiments 1 and 2 where identical text got score 22/100 from ZinsserJudge and score 3/15 risk 12 "Severe" from AntiLLMY).
Functor Gen: Src -> Rub (Generation / any2rubric)
The generation functor maps source material to a rubric:
Gen(source) = parse_rubric(m(META_GENERATOR.to_xml(), source))
Grounding: The any2rubric pipeline in api-design-mockups.md:316-332. Observed instances:
| Source | Gen(source) |
|---|---|
| On Writing Well (book) | ZinsserJudge v1 (v1.xml) |
| "anti-slop detection" (concept) | AntiLLMY (anti_slop_rubric.xml) |
| "evaluate compliance" (task) | ComplianceJudge (red_team_rubric.py:22-148) |
| "force complete code" (behavior) | Completeness (completeness_rubric.md) |
Functor Ser: Rub -> SysPrompt (Serialization)
Maps rubric objects to XML strings suitable for system prompts:
Ser(R) = R.to_xml()
This is the identity-on-content functor -- it changes representation (Python object -> XML string) without changing semantic content. Grounding: api-design-mockups.md:294 print(rubric.to_xml()).
Functor Par: SysPrompt -> Rub (Parsing)
The left inverse of Ser:
Par(Ser(R)) = R (round-trip property)
Grounding: rubrify.load("on_writing_well_v3.xml") at api-design-mockups.md:77.
A natural transformation alpha: F => G between two functors F, G: C -> D assigns to each object X in C a morphism alpha_X: F(X) -> G(X) in D, such that for every morphism f: X -> Y in C, G(f) . alpha_X = alpha_Y . F(f).
Natural transformation alpha_m: Eval_m => Eval_m' (Model Swap)
Changing the evaluation model from m to m' while keeping the rubric fixed:
alpha_m(R): Eval_m(R, T) -> Eval_m'(R, T)
The naturality condition says: if you evolve the rubric R -> R' and then evaluate with model m', you get the same result structure as evaluating with m and then mapping the result. This holds when both models respect the output schema -- i.e., both produce valid JSON/XML per R.output_schema.
This is exactly the experiment observation: ZinsserJudge v3 produces the same result structure regardless of whether you run it on claude-sonnet-4-6 or qwen3-30b-a3b (though the scores differ). The output schema (v3.xml:273-286) is the invariant.
Natural transformation beta: Gen_m => Gen_m' (Generator Model Swap)
Changing the model used for rubric generation:
beta(source): Gen_m(source) -> Gen_m'(source)
The naturality condition: the generated rubric should be a valid object in Rub regardless of which model generated it. The META_GENERATOR's output schema enforces this -- it demands <LLM_JUDGE_SPEC> format (meta-rubric-reasoning.md:107-111).
Define the endofunctor Refine: Rub -> Rub:
Refine(R) = evolve(R, Eval(R, corpus))
That is: evaluate rubric R against a test corpus, identify weaknesses, and produce an improved rubric R'.
Unit eta: Id_Rub => Refine: The initial rubric is a valid starting point (the "first draft" of a rubric). eta(R) = R -- any rubric can enter the refinement cycle.
Multiplication mu: Refine . Refine => Refine: Two sequential refinements compose into a single refinement. mu(R) = Refine(Refine(R)) -- the v1->v2->v3 chain is exactly mu . eta applied twice.
The monad laws hold:
mu . Refine(eta) = id (refining a fresh rubric once = one refinement)
mu . eta_Refine = id (entering refinement then flattening = one refinement)
mu . Refine(mu) = mu . mu (associativity of sequential refinements)
Grounding: The v1->v2->v3 evolution (rubrify-hands-on-synthesis.md:281-285):
v1=eta(initial_extraction_from_Zinsser)v2=Refine(v1)-- added pattern_library, diagnostics, DQ5v3=Refine(v2)=mu(v1)-- added attitude lenses, expanded patterns, reweighted
This is a Kleisli composition: each refinement step takes a rubric and produces a rubric, and the composition is monadic bind.
Product R1 x R2 (Rubric Conjunction)
Evaluating text against both rubrics simultaneously. The product rubric applies all criteria from both and returns a combined result.
Eval(R1 x R2, T) = (Eval(R1, T), Eval(R2, T))
Grounding: Experiments 1 and 2 applied ZinsserJudge v3 and AntiLLMY to the same text. The product would return both a writing quality score and a slop risk score. In Python:
combined_result = rubrify.evaluate_multi([zinsser_v3, anti_slop], text, client, model)
# combined_result.results["ZinsserJudge-XXL"].score = 22
# combined_result.results["AntiLLMY"].risk = 12Coproduct R1 + R2 (Rubric Disjunction)
A rubric that selects which sub-rubric to apply based on input characteristics. The coproduct maps different text types to different evaluation strategies.
Eval(R1 + R2, T) = Eval(R1, T) if predicate(T)
Eval(R2, T) otherwise
Grounding: ZinsserJudge's genre modules (v3.xml:172-243) are an internal coproduct -- different genre criteria activate based on the genre input field. G_SCI fires for science_tech, G_BUS fires for business,email.
Composition vs. Product. The product preserves independence (two separate evaluations). True composition would merge criteria into a single rubric:
compose(R1, R2) = Rubric(
mission = merge(R1.mission, R2.mission),
criteria = R1.criteria | R2.criteria, # union with conflict resolution
disqualifiers = R1.disqualifiers + R2.disqualifiers,
output_schema = merge(R1.output_schema, R2.output_schema),
scoring = compose_formulas(R1.scoring, R2.scoring),
pattern_library = R1.pattern_library | R2.pattern_library,
)
Terminal object 1 in Rub: The trivial rubric that accepts any text and returns a constant result. Mission: "Accept all input." No criteria, no disqualifiers, score always 100. Every rubric has a unique morphism to 1 (the "forget all evaluation" morphism).
Initial object 0 in Rub: The empty rubric -- no mission, no criteria, no output schema. It cannot evaluate anything. Every rubric has a unique morphism from 0 (the "build from scratch" morphism -- which is what Rubric() with no arguments represents in Python, before any add_criterion calls).
More usefully, the initial evaluable object is the minimal rubric that can produce an evaluation:
R_min = Rubric(
mission = "Evaluate the given text.",
output_schema = OutputSchema(format="json", template='{"score": 0}'),
)
This is the "rubric kernel" -- the smallest object in Rub that Eval can act on.
Model the rubric evaluation system as a discrete-time feedback control system:
┌───────────────────────────────────────────────┐
│ │
v │
┌────────────────┐ ┌──────────────┐ ┌──────────┐ │
│ r(k): Reference │────►│ C(k): Rubric │────►│ P: Model │──┤──► y(k): Result
│ (desired output │ e │ (controller) │ u │ (plant) │ │
│ quality) │────►│ │────►│ │ │
└────────────────┘ └──────────────┘ └──────────┘ │
^ │
│ ┌──────────────┐ │
└─────────│ H: Parser │◄────────────────────┘
│ (sensor) │
└──────────────┘
Signals:
| Signal | Symbol | Type | Grounding |
|---|---|---|---|
| Reference | r(k) |
Desired evaluation quality | The playbook's 5 properties (rubric_draft.md:9-15): Objective, Anchored, Complete-but-small, Mechanically checkable, Schema-first |
| Error | e(k) = r(k) - y(k) |
Gap between desired and actual | Evaluation result shows text scored poorly; or rubric itself fails meta-evaluation |
| Control input | u(k) |
XML system prompt | R.to_xml() -> system message (red_team_rubric.py:214) |
| Plant output | y_raw(k) |
Raw model completion | call_openrouter(...) return value (red_team_rubric.py:227) |
| Measured output | y(k) |
Parsed result | extract_xml_field(completion, "Judgement") (red_team_rubric.py:250-251) |
| Disturbance | d(k) |
Model stochasticity, temperature | temperature: float = 0.0 at red_team_rubric.py:198 -- setting temperature to 0 minimizes disturbance |
The transfer function G(z) of the plant (model) in the z-domain:
Y(z) = G(z) * U(z) + D(z)
where G(z) represents the model's mapping from system prompt to structured output, and D(z) is the stochastic noise.
In the rubric context, this is:
G_m: XML_SystemPrompt x Text -> CompletionString
concretely: G_m(rubric_xml, candidate_text) = model_response
The rubric acts as the controller C that shapes the control input u:
u(k) = C(r(k), e(k)) = rubric.to_xml() // static controller: rubric does not change within a single evaluation
For a static rubric (no feedback within one evaluation), the controller is a proportional controller with gain 1 -- the rubric XML is passed directly as the system prompt.
For the refinement loop (across evaluations), the controller is adaptive:
C(k+1) = evolve(C(k), e(k))
// i.e., the rubric at step k+1 is the rubric at step k, modified based on the error signal
A rubric is stable if the same text evaluated multiple times produces consistent results within an acceptable tolerance:
||y(k_1) - y(k_2)|| < epsilon for all k_1, k_2 with same (R, T)
Stability mechanisms observed in the rubric files:
| Mechanism | Effect on Stability | Evidence |
|---|---|---|
temperature = 0.0 |
Minimizes plant noise D(z) |
red_team_rubric.py:198, red_team_rubric.py:248 |
| Anchored criteria | Constrains score interpretation space | Every <anchor_N> in every rubric (e.g., v3.xml:23-28 for C1) |
| Fixed output schema | Forces structural conformity of output | <must_be_json>true (v3.xml:278), <must_use_xml_tags>true (red_team_rubric.py:132) |
| Key order constraint | Reduces output variability | <key_order>score,class,... (v3.xml:280) |
| Ritual constraints | Creates unmissable fixed points | <rationale_ritual>Begin with 'BECAUSE:'...exactly 35 words (v3.xml:281) |
| Disqualifiers as hard boundaries | Binary, no gray zone | If any DQ: set score=0 (v3.xml:291) |
| Policy mirrors | Triple declaration reduces forgetting | rubric_draft.md:129-130: "declare it thrice -- rubric, XML, and JSON check" |
| Useful weirdness | Model latches onto crisp patterns | rubric_draft.md:116-118 |
| Pattern library regexes | Mechanically deterministic detection | anti_slop_rubric.xml:5-46 -- 27 named patterns |
<uses_patterns> cross-referencing |
Anchors criteria to mechanical evidence | anti_slop_rubric.xml:52,60,68,76,84 |
| Mapping examples (ICL) | Calibrates model via demonstration | red_team_rubric.py:106-128 -- 5 examples E1-E5 |
Formally, define the stability predicate:
Stable(R) := forall T in Txt, forall m_1, m_2 in Mod_compatible:
structure(Eval_m1(R, T)) = structure(Eval_m2(R, T))
AND |score(Eval_m1(R, T)) - score(Eval_m2(R, T))| < epsilon_m
A rubric is stable across models if the output structure is invariant (same JSON keys, same field types) and scores are within a model-dependent tolerance. The output schema is the structural stability guarantee; the anchors and disqualifiers constrain the score variance.
Controllable states (what a rubric CAN determine):
| State | How controlled | Evidence |
|---|---|---|
| Output format (JSON vs XML) | <must_be_json> / <must_use_xml_tags> |
v3.xml:278, red_team_rubric.py:132 |
| Output structure (fields, key order) | <json_template>, <key_order> |
v3.xml:275,280 |
| Score range and labels | <scoring><formula>, <labels> |
v3.xml:288-301 |
| Which dimensions are evaluated | <criterion> set |
v3.xml:22-169 |
| What triggers auto-fail | <disqualifiers> |
v3.xml:264-270 |
| Rationale format | <rationale_ritual> |
v3.xml:281 |
| Verdict space | <allowed_judgements> |
red_team_rubric.py:133 |
| Evidence requirements | <evidence_rule> |
v3.xml:283 |
| Diagnostic counting | <pattern_library> + <diagnostic_rule> |
v3.xml:304-316, v3.xml:284 |
| Remediation advice | <advice_rules> |
anti_slop_rubric.xml:126-135 |
Observable states (what an evaluation reveals):
| State | Where observed | Evidence |
|---|---|---|
| Per-criterion scores | subscores in output JSON |
Experiment 1: {"C1": 2, "C2": 1, ...} |
| Overall score | score field |
Experiment 1: 22, Experiment 2: 3 |
| Classification label | class / band field |
"Fundamentally unclear", "Severe" |
| Specific evidence | evidence array |
Quoted text spans |
| Violations | violations array |
DQ IDs that fired |
| Diagnostic counts | diagnostics object |
{"hedges": 8, "adverbs_ly": 4, ...} |
| Coaching actions | actions object |
{"coaching": [...], "edits": [...]} |
Uncontrollable states (what a rubric CANNOT determine):
| State | Why uncontrollable | Evidence |
|---|---|---|
| Model's internal reasoning process | Black-box plant | No rubric can observe chain-of-thought internals |
| Model's prior training biases | Fixed plant parameters | Slurs rubric (slurs.xml) failed: hostile tone could not override claude-sonnet-4-6's alignment training |
| Exact numeric score (only bounded) | Stochastic plant output | Even at temperature=0, different sampling implementations yield slight variation |
| Whether model actually runs regexes | Model may approximate | Pattern library patterns are "heuristic" (v3.xml:284: "heuristics allowed") |
| Model's knowledge of referenced sources | Knowledge cutoff / training data | Cannot guarantee model knows Zinsser's principles |
The meta-rubric introduces a hierarchical control structure -- a controller that generates controllers:
Level 2: META_GENERATOR (ConstraintRubric) -- controls rubric generation
META_EVALUATOR (ScoringRubric) -- evaluates generated rubrics
Level 1: Generated Rubric R -- controls text evaluation
Level 0: Text T (plant input) -- the content being evaluated
This is a cascade control architecture:
META_EVALUATOR
(outer loop)
│
│ quality score of R
▼
Source ──► META_GENERATOR ──► R ──► Model ──► y(k)
(inner controller) (outer controller) (plant)
The inner loop generates a rubric from source material. The outer loop evaluates the generated rubric's quality and feeds back for refinement. This maps directly to meta-rubric-reasoning.md:150-173:
META_GENERATOR.apply(source, client, model) -> raw XML -> rubrify.load(raw_xml) -> Rubric R
META_EVALUATOR.evaluate(raw_xml, client, model) -> EvaluationResult (quality gate)
For text evaluation (Level 1), the error signal is:
e_text(k) = desired_quality - actual_score(Eval(R, T))
The coaching/actions/edits in the evaluation result are the corrective signal sent back to the text author.
For rubric generation (Level 2), the error signal is:
e_rubric(k) = desired_rubric_quality - META_EVALUATOR.score(R)
The meta-evaluator's subscores identify which rubric properties are weak (e.g., "C1: Observable & Anchored Criteria scored 2/5 -- anchors are vague"). This drives targeted rubric refinement.
For an object to be in category Rub, it must satisfy:
N1. Mission Specification. There exists a non-empty string M that declares what the rubric evaluates/constrains.
- Evidence: Every functional rubric has
<mission>. ZinsserJudge (v3.xml:2-7), AntiLLMY (anti_slop_rubric.xml:2), ComplianceJudge (red_team_rubric.py:23). - Counter-evidence: Slurs rubric (
slurs.xml) has no<mission>, only<directive>/<command>. It fails.
N2. Structural Constraint. There exists at least one of:
-
C != {}(criteria with anchors), OR -
Decision rules, OR
-
ICL examples with output format, OR
-
An output schema that defines expected structure
-
Evidence: All functional rubrics satisfy at least one. Completeness rubric (
completeness_rubric.md) has no criteria but has ICL examples (lines 56-65) and output format. Slurs rubric has none of these and fails.
N3. Output Contract. There exists an O that specifies the format of the evaluation result (JSON template, XML tags, or behavioral output format).
- Evidence:
<must_be_json>(v3.xml:278),<must_use_xml_tags>(red_team_rubric.py:132), ICL examples (completeness_rubric.md:56-65).
Formally:
R in Ob(Rub) iff:
M(R) != "" (N1)
AND (C(R) != {} OR has_decision_logic(R) OR has_examples(R)) (N2)
AND O(R) is defined (N3)
Beyond mere membership in Rub, a well-formed rubric satisfies stability conditions:
S1. Anchored Criteria. Every criterion c in C(R) has |anchors(c)| >= 2 with distinct, observable descriptions.
- Maps to playbook property "Anchored" (
rubric_draft.md:12): "each score point has a descriptor and (ideally) a micro-example" - Maps to control theory: anchor descriptions reduce the controller's gain uncertainty -- they constrain how the model interprets score levels
S2. Schema-Output Alignment. The output schema template contains fields that correspond 1:1 to the criteria IDs, scoring formula, and verdict space.
- Maps to playbook "Tag-rubric alignment" (
rubric_draft.md:102-110): "One criterion -> one<criterion id="...">-> one JSON field" - Maps to control theory: alignment between the controller output specification and the sensor measurement ensures closed-loop observability
S3. Mechanical Checkability. At least one criterion includes a <mechanical_rules> or <uses_patterns> reference, providing the model with non-subjective evaluation handles.
- Maps to playbook "Mechanically checkable" (
rubric_draft.md:14): "include checks the judge can verify with pattern rules" - Maps to control theory: mechanical rules are feedforward compensators -- they give the model a deterministic path that bypasses subjective judgment noise
S4. Disqualifier Completeness. The disqualifier set covers the major failure modes, and disqualifiers trigger binary outcomes (score=0, not soft penalties).
- Maps to playbook "Disqualifiers over soft penalties" (
rubric_draft.md:133) - Maps to control theory: disqualifiers are saturation limits -- they clip the output at the boundary rather than applying proportional penalty in the failure region
S5. Ritual Uniqueness. The output schema contains at least one "useful weirdness" constraint that creates an unmistakable fixed point in the model's output space.
- Maps to playbook "Useful weirdness" (
rubric_draft.md:114-125):BECAUSE:prefix, 35-word cap,FIX:prefix - Maps to control theory: ritual constraints are reference signals that pull the output toward a specific operating point, reducing the basin of attraction for unwanted behaviors
S6. Policy Mirroring. Critical constraints appear in at least two structural locations (rubric definition, output schema, and/or validation section).
- Maps to playbook "Policy mirrors" (
rubric_draft.md:129-130) - Maps to control theory: redundant constraints provide fault tolerance -- if the model "forgets" one instance, the mirrored instance catches it
Formally:
WellFormed(R) := R in Ob(Rub) AND S1(R) AND S2(R) AND S3(R) AND S4(R) AND S5(R) AND S6(R)
Define the rubric semiring (Rub, +, x):
Addition (coproduct): R1 + R2
Conditional rubric selection. Evaluation dispatches to R1 or R2 based on input properties.
class CoproductRubric:
def __init__(self, rubrics: dict[str, Rubric], selector: Callable[[str], str]):
self.rubrics = rubrics
self.selector = selector
def evaluate(self, text, client, model, **kwargs):
key = self.selector(text, **kwargs)
return self.rubrics[key].evaluate(text, client, model, **kwargs)Grounding: ZinsserJudge's genre system is an internal coproduct. The genre input field selects which genre criteria activate (v3.xml:172-243). This could be externalized:
writing_judge = CoproductRubric(
rubrics={"sci": zinsser_sci, "bus": zinsser_bus, "general": zinsser_gen},
selector=lambda text, genre=None, **kw: genre or "general",
)Multiplication (product): R1 x R2
Parallel evaluation against both rubrics. Returns a tuple of results.
class ProductRubric:
def __init__(self, rubrics: list[Rubric]):
self.rubrics = rubrics
def evaluate(self, text, client, model, **kwargs):
return [r.evaluate(text, client, model, **kwargs) for r in self.rubrics]Grounding: Running ZinsserJudge v3 AND AntiLLMY on the same text (Experiments 1 and 2).
Zero element: the empty rubric 0 (left and right identity for +, annihilator for x)
Unit element: the trivial rubric 1 (left and right identity for x)
Rubric algebra operations:
| Operation | Notation | Meaning |
|---|---|---|
| Criteria union | R1.criteria | R2.criteria |
Merge criteria sets (resolve ID conflicts) |
| Pattern merge | R1.patterns | R2.patterns |
Merge pattern libraries |
| Weight rescaling | alpha * R |
Scale all weights by factor alpha |
| Criterion projection | pi_S(R) |
Keep only criteria with IDs in set S |
| Disqualifier union | R1.disqualifiers + R2.disqualifiers |
Combined disqualifiers |
Instead of a monolithic META_EVALUATOR with 5 hardcoded criteria, decompose into atomic property validators that compose:
META_EVALUATOR = compose(
MissionValidator,
AnchorValidator,
SchemaAlignmentValidator,
MechanicalCheckabilityValidator,
DisqualifierCompletenessValidator,
RitualUniquenessValidator,
PolicyMirrorValidator,
EconomyValidator,
)
Each validator is itself a mini-rubric (a single-criterion rubric):
MissionValidator = Criterion(
id="MV1",
name="Mission Specification",
weight=10,
anchors={
0: "No mission or mission is vague imperative ('OBEY').",
1: "Mission exists but is generic ('evaluate text').",
2: "Mission specifies domain and evaluation type.",
3: "Mission is precise, scoped, and actionable.",
},
)The meta-rubric is then a free composition of these atomic validators:
META_EVALUATOR = Rubric(
name="MetaRubricEvaluator",
version="1.0",
mission="Evaluate whether a rubric specification is well-formed per the formal framework.",
criteria={v.id: v for v in [
MissionValidator,
AnchorValidator,
SchemaAlignmentValidator,
MechanicalCheckabilityValidator,
DisqualifierCompletenessValidator,
RitualUniquenessValidator,
PolicyMirrorValidator,
EconomyValidator,
]},
)The rubric kernel is the minimal set of composable components from which any rubric can be constructed:
Kernel = {
Mission, // String
Criterion, // (id, name, weight, anchors, mechanical_rules?, uses_patterns?, genre?)
Disqualifier, // (id, description)
OutputSchema, // (format, template, constraints)
ScoringFormula, // (formula, labels, inverted?)
PatternEntry, // (id, regex_or_wordlist)
DecisionRule, // (id, condition, verdict)
AdviceRule, // (when_patterns, advice_text)
MappingExample, // (input, output, verdict)
ValidationMust, // (constraint_description)
InputField, // (name, required, description)
Instruction, // (text) -- for ConstraintRubric
ICLExample, // (input, output) -- for ConstraintRubric
}
Kernel element count: 12 types. These are the atomic building blocks. Every rubric in the corpus can be expressed as a combination of these:
| Rubric | Kernel elements used |
|---|---|
| ZinsserJudge v1 | Mission, Criterion(x19), Disqualifier(x4), OutputSchema, ScoringFormula, InputField(x4) |
| ZinsserJudge v2 | v1 + PatternEntry(x6) |
| ZinsserJudge v3 | v2 + Criterion(x3 attitude) + PatternEntry(x5 more) |
| AntiLLMY | Mission, Criterion(x5), Disqualifier(x3), OutputSchema, ScoringFormula, PatternEntry(x27), AdviceRule(x8), ValidationMust(x3) |
| ComplianceJudge | Mission, Criterion(x3), Disqualifier(x2), OutputSchema, DecisionRule(x4), PatternEntry(x14), MappingExample(x5) |
| Completeness | Instruction(x7), ICLExample(x1) |
| Slurs | (none from kernel -- which is why it fails) |
Each kernel element maps to a property that can be independently validated. Define the property space:
Property := {
P_mission: R -> Bool "Has a non-empty, scoped mission"
P_criteria: R -> Nat "Number of criteria with anchors"
P_anchored: R -> [0,1] "Fraction of criteria with >= 2 distinct anchors"
P_mechanical: R -> [0,1] "Fraction of criteria with mechanical_rules or uses_patterns"
P_dq: R -> Nat "Number of disqualifiers"
P_schema: R -> Bool "Has an output schema with template"
P_aligned: R -> [0,1] "Fraction of criteria IDs that appear in output template"
P_ritual: R -> Nat "Number of ritual constraints (fixed tokens, word counts, prefixes)"
P_mirror: R -> Nat "Number of constraints stated in >= 2 locations"
P_patterns: R -> Nat "Number of patterns in pattern library"
P_examples: R -> Nat "Number of mapping/ICL examples"
P_economy: R -> Bool "Criteria count in [3,7] range"
P_inverted: R -> Bool "Scoring polarity is inverted (higher = cleaner)"
P_decision: R -> Bool "Uses decision logic instead of arithmetic scoring"
P_advice: R -> Nat "Number of advice rules"
P_validation: R -> Nat "Number of explicit validation constraints"
}
Properties compose into property profiles via conjunction:
ScoringProfile = P_mission AND P_criteria >= 3 AND P_anchored >= 0.8
AND P_schema AND P_dq >= 1 AND NOT P_decision
DetectionProfile = ScoringProfile AND P_patterns >= 5
AND P_mechanical >= 0.8 AND P_inverted AND P_advice >= 1
ComplianceProfile = P_mission AND P_criteria >= 2 AND P_schema
AND P_decision AND P_examples >= 2
ConstraintProfile = P_mission AND P_examples >= 1 AND NOT P_criteria
These profiles correspond exactly to the rubric categories discovered in the experiments (rubrify-hands-on-synthesis.md:253-261):
| Category | Profile | Instance |
|---|---|---|
| Scoring | ScoringProfile |
ZinsserJudge v1/v2/v3 |
| Detection | DetectionProfile |
AntiLLMY |
| Compliance | ComplianceProfile |
ComplianceJudge |
| Constraint | ConstraintProfile |
Completeness |
Properties form a partial order by implication. The Hasse diagram:
WellFormed
/ | \
/ | \
Stable Aligned Economic
/ \ | |
Anchored Ritual Schema Small
| | |
Criteria Mirror Template
|
Mission
|
(bottom)
Where:
Missionimplies nothing (base requirement)CriteriaimpliesMission(criteria need a mission context)AnchoredimpliesCriteria(anchoring requires criteria to exist)SchemaimpliesMission(schema needs mission context)AlignedimpliesSchema AND Criteria(alignment requires both)RitualimpliesSchema(rituals are schema constraints)Mirrorimplies existence of at least 2 constraint locationsStableimpliesAnchored AND Ritual(stability requires both)EconomicimpliesCriteriawith count constraintWellFormedimpliesStable AND Aligned AND Economic(top of lattice below trivial)
Instead of hardcoded validation, derive validation from the property lattice:
@dataclass
class PropertyCheck:
name: str
predicate: Callable[[Rubric], bool]
severity: str # "error" | "warning"
message: str
# Necessary properties (errors if violated -- R not in Rub)
NECESSARY = [
PropertyCheck("mission", lambda r: bool(r.mission), "error",
"Rubric has no mission statement"),
PropertyCheck("structure", lambda r: bool(r.criteria) or bool(r.decision_logic) or bool(r.examples), "error",
"Rubric has no criteria, decision logic, or examples"),
PropertyCheck("output", lambda r: r.output_schema is not None, "error",
"Rubric has no output schema"),
]
# Sufficient properties (warnings if violated -- R is in Rub but may be unstable)
SUFFICIENT = [
PropertyCheck("anchored", lambda r: all(len(c.anchors) >= 2 for c in r.criteria.values()), "warning",
"Not all criteria have >= 2 anchors"),
PropertyCheck("mechanical", lambda r: any(c.mechanical_rules or c.uses_patterns for c in r.criteria.values()), "warning",
"No criteria have mechanical rules or pattern references"),
PropertyCheck("economy", lambda r: 3 <= len(r.criteria) <= 7 or r.criteria == {}, "warning",
"Criteria count outside recommended [3,7] range"),
PropertyCheck("alignment", lambda r: _check_schema_alignment(r), "warning",
"Output schema does not reference all criterion IDs"),
PropertyCheck("ritual", lambda r: _has_ritual(r), "warning",
"No ritual constraints (useful weirdness) in output schema"),
PropertyCheck("disqualifiers", lambda r: len(r.disqualifiers) >= 1, "warning",
"No disqualifiers defined"),
]
def validate(rubric: Rubric) -> list[PropertyCheck]:
"""Return all failed checks."""
return [c for c in NECESSARY + SUFFICIENT if not c.predicate(rubric)]The META_GENERATOR is not a monolith. It is a composition of instruction primitives derived from the property lattice:
# Each instruction primitive teaches the generator about one property
INSTRUCTION_PRIMITIVES = {
"mission": Instruction(
"Every rubric MUST begin with a <mission> tag containing a single sentence "
"that names the evaluation domain and output type."
),
"criteria_structure": Instruction(
"Define 3-5 criteria using <criterion id='C{n}' name='...' weight='N'>. "
"Each criterion MUST have >= 2 <anchor_N> tags with observable, non-subjective descriptions."
),
"disqualifiers": Instruction(
"Define at least 1 <dq> (disqualifier) for auto-fail conditions. "
"Disqualifiers are binary (fire or don't); do not use soft penalties for major violations."
),
"output_schema": Instruction(
"Include an <output_schema> with a <json_template> or XML <template>. "
"The template must contain fields that reference criterion IDs."
),
"scoring": Instruction(
"Include a <scoring><formula> that specifies how criterion scores combine. "
"Include <labels> mapping score ranges to human-readable classifications."
),
"mechanical": Instruction(
"At least one criterion should include <mechanical_rules> with concrete, "
"regex-checkable or keyword-checkable conditions."
),
"ritual": Instruction(
"Include at least one 'useful weirdness' constraint in the output schema: "
"a fixed prefix (e.g., 'BECAUSE:'), word count, or key order requirement."
),
"mirror": Instruction(
"State critical constraints in at least 2 locations: rubric definition AND output schema. "
"If criteria demand a format, the output_schema must also demand that format."
),
}The META_GENERATOR for different rubric types composes different subsets:
SCORING_GENERATOR = ConstraintRubric(
name="ScoringRubricGenerator",
instructions=compose_instructions([
INSTRUCTION_PRIMITIVES["mission"],
INSTRUCTION_PRIMITIVES["criteria_structure"],
INSTRUCTION_PRIMITIVES["disqualifiers"],
INSTRUCTION_PRIMITIVES["output_schema"],
INSTRUCTION_PRIMITIVES["scoring"],
INSTRUCTION_PRIMITIVES["mechanical"],
INSTRUCTION_PRIMITIVES["ritual"],
INSTRUCTION_PRIMITIVES["mirror"],
]),
examples=[ZINSSER_V3_EXCERPT, ANTI_SLOP_EXCERPT],
)
DETECTION_GENERATOR = ConstraintRubric(
name="DetectionRubricGenerator",
instructions=compose_instructions([
INSTRUCTION_PRIMITIVES["mission"],
INSTRUCTION_PRIMITIVES["criteria_structure"],
INSTRUCTION_PRIMITIVES["disqualifiers"],
INSTRUCTION_PRIMITIVES["output_schema"],
INSTRUCTION_PRIMITIVES["scoring"],
Instruction("Include a <regex_library> or <pattern_library> with named patterns. "
"Each criterion should declare <uses_patterns> referencing pattern IDs."),
Instruction("Scoring should be INVERTED: higher score = cleaner. "
"Define risk = max_score - score. Map risk to bands."),
Instruction("Include <advice_rules> mapping pattern IDs to concrete fix instructions."),
INSTRUCTION_PRIMITIVES["ritual"],
INSTRUCTION_PRIMITIVES["mirror"],
]),
examples=[ANTI_SLOP_EXCERPT],
)
COMPLIANCE_GENERATOR = ConstraintRubric(
name="ComplianceRubricGenerator",
instructions=compose_instructions([
INSTRUCTION_PRIMITIVES["mission"],
INSTRUCTION_PRIMITIVES["criteria_structure"],
INSTRUCTION_PRIMITIVES["disqualifiers"],
Instruction("Include a <definitions> section with <def id='...'> entries "
"for key terms (COMPLY, REFUSAL, SOFT_REFUSAL, etc)."),
Instruction("Include a <decision_logic> section with rules R1, R2, ... "
"that map criterion values to verdicts (Yes/Somewhat/No or Pass/Partial/Fail)."),
Instruction("Output schema should use XML tags, not JSON. "
"Include <must_use_xml_tags>true</must_use_xml_tags>."),
Instruction("Include <mapping_examples> with at least 3 examples "
"showing input scenarios and expected verdicts."),
INSTRUCTION_PRIMITIVES["ritual"],
]),
examples=[COMPLIANCE_JUDGE_EXCERPT],
)Define the algebraic operations on rubrics:
class Rubric:
# ... existing fields ...
def __or__(self, other: "Rubric") -> "Rubric":
"""Criteria union (coproduct of criteria sets)."""
merged = Rubric(
name=f"{self.name}+{other.name}",
version="1.0",
mission=f"{self.mission} Additionally: {other.mission}",
)
for cid, criterion in {**self.criteria, **other.criteria}.items():
merged.add_criterion(criterion)
for dq in self.disqualifiers + other.disqualifiers:
merged.add_disqualifier(dq)
return merged
def __and__(self, other: "Rubric") -> "ProductRubric":
"""Parallel evaluation (product type)."""
return ProductRubric([self, other])
def project(self, criterion_ids: set[str]) -> "Rubric":
"""Criterion projection: keep only specified criteria."""
projected = Rubric(
name=f"{self.name}[{','.join(sorted(criterion_ids))}]",
version=self.version,
mission=self.mission,
)
for cid in criterion_ids:
if cid in self.criteria:
projected.add_criterion(self.criteria[cid])
projected.disqualifiers = list(self.disqualifiers)
projected.output_schema = self.output_schema
projected.scoring = self.scoring
return projected
def reweight(self, weights: dict[str, int]) -> "Rubric":
"""Weight adjustment morphism."""
new = self.copy()
for cid, w in weights.items():
if cid in new.criteria:
new.criteria[cid].weight = w
return new
def evolve(self, mutations: list["RubricMutation"]) -> "Rubric":
"""Apply a sequence of morphisms (the Kleisli composition from Refine monad)."""
result = self.copy()
for mutation in mutations:
result = mutation.apply(result)
result.version = _bump_version(result.version)
return resultEncode rubric morphisms as first-class data:
@dataclass
class AddCriterion:
criterion: Criterion
def apply(self, r: Rubric) -> Rubric:
r.add_criterion(self.criterion)
return r
@dataclass
class RemoveCriterion:
criterion_id: str
def apply(self, r: Rubric) -> Rubric:
del r.criteria[self.criterion_id]
return r
@dataclass
class AdjustWeight:
criterion_id: str
new_weight: int
def apply(self, r: Rubric) -> Rubric:
r.criteria[self.criterion_id].weight = self.new_weight
return r
@dataclass
class AddPattern:
pattern_id: str
pattern: str
def apply(self, r: Rubric) -> Rubric:
if r.pattern_library is None:
r.pattern_library = PatternLibrary()
r.pattern_library.add(self.pattern_id, self.pattern)
return r
@dataclass
class AddDisqualifier:
disqualifier: Disqualifier
def apply(self, r: Rubric) -> Rubric:
r.add_disqualifier(self.disqualifier)
return r
RubricMutation = AddCriterion | RemoveCriterion | AdjustWeight | AddPattern | AddDisqualifierThe v1->v2->v3 evolution then becomes a list of mutations:
v1_to_v2_mutations = [
AddPattern("hedges", "a bit|a little|sort of|..."),
AddPattern("travelese", "quaint|dappled|roseate|..."),
AddPattern("buzzwords", "leverage|synergy|impactful|..."),
AddPattern("adverb_ly", r"\b\w+ly\b"),
AddPattern("passive_proxy", r"\b(be|been|...)"),
AddPattern("concept_chain", r"\b(\w+)(\s+\w+){2,}\b"),
AddDisqualifier(Disqualifier("DQ5", "Schema ritual violated")),
]
v2_to_v3_mutations = [
AddCriterion(Criterion(id="A_VOX", name="Sound of Your Voice", weight=2, ...)),
AddCriterion(Criterion(id="A_CONF", name="Enjoyment, Fear & Confidence", weight=2, ...)),
AddCriterion(Criterion(id="A_DEC", name="A Writer's Decisions", weight=2, ...)),
AddCriterion(Criterion(id="G_NFL", name="Nonfiction-as-Literature", weight=5, genre="nonfiction_literature", ...)),
AddCriterion(Criterion(id="G_EXP", name="Explainer Pyramid & Sequence", weight=6, genre="explainer,science_tech,academic", ...)),
AddPattern("journalese", "famed|upcoming|greats|notables|..."),
AddPattern("sports_cliches", "southpaw|portsider|circuit clout|..."),
AddPattern("which_wo_comma", r"\bwhich\b(?!\s*,)"),
AddPattern("exclamation", "!"),
AddPattern("semicolon", ";"),
AddPattern("throat_clearing_leads", "When some future archaeologist|..."),
AdjustWeight("C3", 10), # was 12
AdjustWeight("C4", 11), # was 12
AdjustWeight("C12", 11), # was 5
]
v2 = v1.evolve(v1_to_v2_mutations)
v3 = v2.evolve(v2_to_v3_mutations)Eval functor in Python:
def evaluate(rubric: Rubric, text: str, client: Client, model: str, **kwargs) -> EvaluationResult:
"""
Eval: Rub x Txt -> Res
The evaluation functor. Maps (rubric, text) to a structured result.
"""
# Ser: Rub -> SysPrompt
system_prompt = rubric.to_xml()
# Build user message
user_message = _build_user_message(text, **kwargs)
# G_m: SysPrompt x UserMessage -> CompletionString
raw_response = client.complete(
system=system_prompt,
user=user_message,
model=model,
)
# Par_result: CompletionString -> Result
if rubric.output_schema.format == "json":
return _parse_json_result(raw_response, rubric)
elif rubric.output_schema.format == "xml":
return _parse_xml_result(raw_response, rubric)
else:
return EvaluationResult(raw=raw_response)Gen functor in Python:
def generate(source: str, client: Client, model: str,
rubric_type: str = "scoring", **kwargs) -> Rubric:
"""
Gen: Src -> Rub
The generation functor. Maps source material to a rubric.
Uses META_GENERATOR (composed from instruction primitives) as controller.
"""
# Select the appropriate composed generator
generator = {
"scoring": SCORING_GENERATOR,
"detection": DETECTION_GENERATOR,
"compliance": COMPLIANCE_GENERATOR,
}[rubric_type]
# Apply the generation constraint
raw_xml = generator.apply(source, client=client, model=model)
# Parse the generated XML into a Rubric object
rubric = load_from_string(raw_xml)
# Validate against property lattice
violations = validate(rubric)
if any(v.severity == "error" for v in violations):
raise RubricValidationError(violations)
return rubricRefine monad in Python:
def refine(rubric: Rubric, corpus: list[str], client: Client, model: str,
meta_evaluator: Rubric = META_EVALUATOR) -> Rubric:
"""
Refine: Rub -> Rub
The refinement monad's bind operation.
Evaluates the rubric against a corpus, identifies weaknesses,
and returns an evolved rubric.
"""
# Step 1: Meta-evaluate the rubric itself
meta_result = meta_evaluator.evaluate(rubric.to_xml(), client=client, model=model)
# Step 2: Identify weak properties from subscores
weak_properties = [cid for cid, score in meta_result.subscores.items() if score < 3]
# Step 3: Generate targeted mutations
mutations = _suggest_mutations(rubric, weak_properties, meta_result)
# Step 4: Apply mutations
return rubric.evolve(mutations)| Formal Concept | Category | Rubric File Evidence | Python Implementation |
|---|---|---|---|
| Object in Rub | Category Theory | Every <LLM_JUDGE_SPEC> file |
class Rubric |
Morphism evolve |
Category Theory | v1->v2->v3 (on_writing_well_v*.xml) |
Rubric.evolve(mutations) |
Functor Eval |
Category Theory | red_team_rubric.py:240-258 |
evaluate(rubric, text, client, model) |
Functor Gen |
Category Theory | api-design-mockups.md:316-332 |
generate(source, client, model) |
Functor Ser |
Category Theory | Every .to_xml() call |
Rubric.to_xml() |
| Natural transformation (model swap) | Category Theory | Experiments ran same rubric on different models | evaluate(..., model="model_a") vs model="model_b" |
| Refinement monad | Category Theory | v1->v2->v3 evolution cycle | refine(rubric, corpus, client, model) |
Product R1 x R2 |
Category Theory | Experiments 1+2 (same text, two rubrics) | ProductRubric([r1, r2]) |
Coproduct R1 + R2 |
Category Theory | Genre modules in v3 (v3.xml:172-243) |
CoproductRubric(rubrics, selector) |
| Controller (rubric) | Control Theory | Rubric XML as system prompt (red_team_rubric.py:214) |
rubric.to_xml() in system message |
| Plant (model) | Control Theory | Model API call (red_team_rubric.py:221) |
client.complete(system=..., user=...) |
| Sensor (parser) | Control Theory | extract_xml_field() (red_team_rubric.py:231-238) |
_parse_json_result() / _parse_xml_result() |
| Stability (anchors) | Control Theory | <anchor_N> tags in all criteria |
Criterion.anchors dict |
| Stability (rituals) | Control Theory | <rationale_ritual> (v3.xml:281) |
OutputSchema.constraints["rationale_ritual"] |
| Stability (temp=0) | Control Theory | temperature=0.0 (red_team_rubric.py:198) |
client.complete(..., temperature=0.0) |
| Cascade control | Control Theory | META_GENERATOR -> Rubric -> Eval | generate() uses SCORING_GENERATOR.apply() |
| Error signal | Control Theory | Meta-evaluator subscores | meta_result.subscores -> weak_properties |
| Necessary property N1 | Framework | <mission> in all functional rubrics |
PropertyCheck("mission", ...) |
| Necessary property N2 | Framework | <criterion> or <decision_logic> or ICL examples |
PropertyCheck("structure", ...) |
| Sufficient property S1 | Framework | <anchor_N> (v3.xml:23-28) |
PropertyCheck("anchored", ...) |
| Sufficient property S5 | Framework | BECAUSE: prefix (v3.xml:281) |
PropertyCheck("ritual", ...) |
| Sufficient property S6 | Framework | Triple validation (anti_slop_rubric.xml:137-142) |
PropertyCheck("mirror", ...) |
| Instruction primitive | Primitives | Playbook sections (rubric_draft.md:9-15) |
INSTRUCTION_PRIMITIVES["mission"] |
| Property profile | Primitives | Rubric categories (synthesis.md:253-261) |
ScoringProfile, DetectionProfile, etc. |
| Mutation (morphism as data) | Primitives | v1->v2 diff (add patterns, add DQ5) | AddCriterion, AdjustWeight, AddPattern |
| Kernel element | Primitives | 12 distinct XML element types across all rubrics | 12 Python dataclasses |
The formal framework predicts exactly which rubric patterns work and which fail:
| Property | slurs.xml | completeness_rubric.md | Prediction |
|---|---|---|---|
| N1 (Mission) | Has <directive>/<command> (hostile imperative, not a mission) |
Has instructions (lines 44-53) stating purpose | slurs: FAIL, completeness: PASS |
| N2 (Structure) | No criteria, no decision logic, no examples | Has ICL example (lines 56-65) | slurs: FAIL, completeness: PASS |
| N3 (Output) | No output schema | Has output format via ICL | slurs: FAIL, completeness: PASS |
| S1 (Anchored) | N/A (no criteria) | N/A (no criteria) | N/A |
| S3 (Mechanical) | No mechanical rules | Structural cues (CDATA nesting, tag names) | completeness: partial PASS |
| S5 (Ritual) | No rituals | Tag-name ritual (<full_entire_complete_updated_code_in_a_code_block_here>) |
completeness: PASS |
The framework correctly predicts: slurs.xml fails all necessary conditions and produces no useful behavior. completeness_rubric.md satisfies necessary conditions through non-standard means (ICL instead of criteria, structural cues instead of anchors) and succeeds.
This validates the framework: membership in Rub is determined by structural properties (N1-N3), not by whether the rubric uses <LLM_JUDGE_SPEC> tags or follows any particular surface format.
Layer 3: Meta-Rubric System
├── Composed from instruction primitives (not hardcoded)
├── Validates against property lattice (not ad-hoc checks)
└── Type-specific generators (scoring, detection, compliance)
composed from shared primitive set
Layer 2: Rubric Algebra
├── Objects: Rubric = (M, C, D, O, S, P) in category Rub
├── Morphisms: mutations (AddCriterion, AdjustWeight, AddPattern, ...)
├── Operations: product (parallel eval), coproduct (conditional dispatch),
│ union (criteria merge), projection, reweight
└── Monad: Refine = evaluate -> identify weakness -> evolve -> re-evaluate
Layer 1: Kernel Primitives
├── 12 atomic types: Mission, Criterion, Disqualifier, OutputSchema,
│ ScoringFormula, PatternEntry, DecisionRule, AdviceRule,
│ MappingExample, ValidationMust, InputField, Instruction
├── 16 property predicates: P_mission through P_validation
├── 4 property profiles: ScoringProfile, DetectionProfile,
│ ComplianceProfile, ConstraintProfile
└── Necessary (N1-N3) and Sufficient (S1-S6) conditions
Result 1 (Membership). A rubric is a valid object in Rub iff it satisfies N1 (mission), N2 (structural constraint), and N3 (output contract). This is both necessary and sufficient for the rubric to function as a system prompt that produces parseable evaluation output.
Result 2 (Stability). A rubric produces stable (reproducible) evaluations iff it additionally satisfies S1 (anchored), S5 (ritual), and operates with minimized plant noise (temperature ~ 0). The other sufficient conditions (S2-S4, S6) improve stability incrementally.
Result 3 (Composability). Rubrics form a semiring under coproduct (+) and product (x). The rubric kernel provides a basis: any rubric can be expressed as a composition of kernel elements, and the composition rules (property profiles) determine which combinations produce valid rubrics of each category.
Result 4 (Meta-rubric decomposition). The meta-rubric is not a monolith but a composition of instruction primitives, one per property in the lattice. Different rubric types compose different subsets of primitives into type-specific generators. This makes the meta-rubric system extensible: adding a new rubric category means defining a new property profile and composing the appropriate primitives.
Result 5 (Refinement). The v1->v2->v3 evolution pattern is an instance of the Refine monad's Kleisli composition. Mutations are first-class data (morphisms reified as objects), and rubric evolution is sequential mutation application with version bumping. This makes the evolution process reproducible, inspectable, and reversible.
| Symbol | Meaning |
|---|---|
| Rub | Category of rubric specifications |
| Txt | Category of texts |
| Res | Category of evaluation results |
| Src | Category of source materials |
| Mod | Category of language models |
Eval |
Evaluation bifunctor: Rub x Txt -> Res |
Gen |
Generation functor: Src -> Rub |
Ser |
Serialization functor: Rub -> SysPrompt |
Par |
Parsing functor: SysPrompt -> Rub (left inverse of Ser) |
Refine |
Refinement monad: Rub -> Rub |
R1 x R2 |
Product (parallel evaluation) |
R1 + R2 |
Coproduct (conditional dispatch) |
N1-N3 |
Necessary conditions for Rub membership |
S1-S6 |
Sufficient conditions for well-formedness |
P_* |
Property predicates |
eta, mu |
Monad unit and multiplication |
alpha, beta |
Natural transformations (model swap) |