Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save bigsnarfdude/73268210c85910bdc05e9262a35d3be8 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/73268210c85910bdc05e9262a35d3be8 to your computer and use it in GitHub Desktop.
rrma_harness_improvements_v5_proposals.md

Meta-Harness (Stanford IRIS Lab, Lee et al. 2026) — a framework for end-to-end optimization of model harnesses: evolves the code around a fixed base model (memory systems, scaffolds, retrieval) via LLM-proposer loops. Two reference experiments: text classification memory search, Terminal-Bench 2 scaffold evolution.

Patterns worth liberating for RRMA

  1. Frontier/Pareto-tracked evolution loop (frontier_val.json + evolution_summary.jsonl) — every candidate logged with hypothesis, axis, delta vs best. RRMA's blackboard already does this, but the explicit _pareto structure with best_system per-dataset is cleaner than RRMA's current ad-hoc leaderboards.
  2. Propose → validate → benchmark → update as discrete phases with per-phase timing (propose_time, bench_time, wall_time). Gives TrustLoop better signal for classifying BREAKTHROUGH/PLATEAU than RRMA's coarser iteration timing.
  3. Import-check validation gate (meta_harness.py:179) — candidates must pass python -c "from ... import *" before eating benchmark budget. Cheap failure mode for RRMA's code-generating agents; currently RRMA discovers broken candidates inside the expensive eval.
  4. Skill-scoped proposer (.claude/skills/meta-harness) + allowed_tools whitelist + subscription-auth fallback (strip ANTHROPIC_API_KEY). RRMA could narrow tool surfaces per-agent-role the same way.
  5. Candidate schema with axis + hypothesis + components fields — forces the proposer to declare what it's varying. Maps directly onto RRMA's DESIRES/MISTAKES/LEARNINGS structure and would make the gardener's job easier.
  6. Held-out test never exposed during evolution — only run in Phase Final on frontier systems. RRMA's chaos experiments sometimes leak test signal; worth auditing.

terminal_bench_2/ (scaffold evolution) ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment