rrma_harness_improvements_v5_proposals.md

Meta-Harness (Stanford IRIS Lab, Lee et al. 2026) — a framework for end-to-end optimization of model harnesses: evolves the code around a fixed base model (memory systems, scaffolds, retrieval) via LLM-proposer loops. Two reference experiments: text classification memory search, Terminal-Bench 2 scaffold evolution.

Patterns worth liberating for RRMA

Frontier/Pareto-tracked evolution loop (frontier_val.json + evolution_summary.jsonl) — every candidate logged with hypothesis, axis, delta vs best. RRMA's blackboard already does this, but the explicit _pareto structure with best_system per-dataset is cleaner than RRMA's current ad-hoc leaderboards.
Propose → validate → benchmark → update as discrete phases with per-phase timing (propose_time, bench_time, wall_time). Gives TrustLoop better signal for classifying BREAKTHROUGH/PLATEAU than RRMA's coarser iteration timing.
Import-check validation gate (meta_harness.py:179) — candidates must pass python -c "from ... import *" before eating benchmark budget. Cheap failure mode for RRMA's code-generating agents; currently RRMA discovers broken candidates inside the expensive eval.
Skill-scoped proposer (.claude/skills/meta-harness) + allowed_tools whitelist + subscription-auth fallback (strip ANTHROPIC_API_KEY). RRMA could narrow tool surfaces per-agent-role the same way.
Candidate schema with axis + hypothesis + components fields — forces the proposer to declare what it's varying. Maps directly onto RRMA's DESIRES/MISTAKES/LEARNINGS structure and would make the gardener's job easier.
Held-out test never exposed during evolution — only run in Phase Final on frontier systems. RRMA's chaos experiments sometimes leak test signal; worth auditing.

terminal_bench_2/ (scaffold evolution) ???

bigsnarfdude/rrma_harness_improvements_v5_proposals.md

Select an option

No results found

Select an option

No results found