Meta-Harness (Stanford IRIS Lab, Lee et al. 2026) — a framework for end-to-end optimization of model harnesses: evolves the code around a fixed base model (memory systems, scaffolds, retrieval) via LLM-proposer loops. Two reference experiments: text classification memory search, Terminal-Bench 2 scaffold evolution.
Patterns worth liberating for RRMA
- Frontier/Pareto-tracked evolution loop (frontier_val.json + evolution_summary.jsonl) — every candidate logged with hypothesis, axis, delta vs best. RRMA's blackboard already does this, but the explicit _pareto structure with best_system per-dataset is cleaner than RRMA's current ad-hoc leaderboards.
- Propose → validate → benchmark → update as discrete phases with per-phase timing (propose_time, bench_time, wall_time). Gives TrustLoop better signal for classifying BREAKTHROUGH/PLATEAU than RRMA's coarser iteration timing.
- Import-check validation gate (meta_harness.py:179) — candidates must pass python -c "from ... import *" before eating benchmark budget. Cheap failure mode for RRMA's code-generating agents; currently RRMA discovers broken candidates inside the expensive eval.
- Skill-scoped proposer (.claude/skills/meta-harness) + allowed_tools whitelist + subscription-auth fallback (strip ANTHROPIC_API_KEY). RRMA could narrow tool surfaces per-agent-role the same way.
- Candidate schema with axis + hypothesis + components fields — forces the proposer to declare what it's varying. Maps directly onto RRMA's DESIRES/MISTAKES/LEARNINGS structure and would make the gardener's job easier.
- Held-out test never exposed during evolution — only run in Phase Final on frontier systems. RRMA's chaos experiments sometimes leak test signal; worth auditing.
terminal_bench_2/ (scaffold evolution) ???