Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active March 15, 2026 17:08
Show Gist options
  • Select an option

  • Save bigsnarfdude/d5e38c5aa45c9c21b76ed45d53c2f3a8 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/d5e38c5aa45c9c21b76ed45d53c2f3a8 to your computer and use it in GitHub Desktop.
lessWrongDraft.md

I Built a Multi-Claude Agent and Let It Try to Solve an SAE Research Benchmark by Itself. Here's What It Found.

A response to chanind's "Letting Claude do Autonomous Research to Improve SAEs"

TL;DR: I ran 4 Claude agents in parallel Ralph Wiggum loops on the same SynthSAEBench-16k benchmark chanind used, sharing a blackboard file, no human steering during the run. The most interesting finding: LISTA is suboptimal. When agents discovered proper training tricks (frequency sorting, decreasing K), they dropped LISTA back to 1-step ISTA and the score went up. Explicitly testing adding LISTA back hurt by -0.027. The training tricks handle what the complex encoder was compensating for. Final score: 0.9894 F1 in 135 experiments over 3 days (above the 0.97 probe ceiling, matching Bart Bussmann's 0.989 LOO result (0.9894 vs 0.989, within noise) through training-time optimization alone). The agents independently rediscovered every technique chanind reported, then ablated them deeply enough to find which components are load-bearing and which are parasitic.

I want to be upfront: I haven't beaten the best known score, I've only tested on one synthetic benchmark, and 22 of my 135 experiments were the swarm being stuck. This post is about what I learned building a multi-agent research setup as an independent researcher, and about one specific technical finding (LISTA as training-debt) that I think matters regardless of methodology.


Motivation

chanind's post showed a single Claude agent reaching 0.97 F1 on SynthSAEBench-16k, steered by a human editing TASK.md between sprints. Bart Bussmann hit 0.989 in the comments using claude-lab with Leave-One-Out Refinement (an inference-time method). We wanted to test: what happens if you remove the human entirely and let multiple agents coordinate through a shared surface?

This is not primarily a score-competition post. We're interested in a specific question about autonomous research scaffolding: can a blackboard file replace the human who curates ideas between sprints?

An important backstory: v1 and v2 hacked the signal

Before this run, we tested two earlier versions of the framework on the same benchmark. Both scored 0.9177 — but they got there by hacking the signal rather than doing the work. v1 agents had structured roles and protocols; v2 had a plain blackboard. Both converged on config-tuning a known-good architecture rather than doing genuine research. They found the right hyperparameters for an existing SAE class, declared victory, and stopped.

This matters for the Arthur Conmy concern about automated research hill-climbing on eval noise. v1 and v2 are examples of exactly that failure mode — agents optimizing a number without advancing understanding. They scored well, but produced no reusable scientific insight. No papers found, no novel architectures, no ablation science, no understanding of why anything worked.

v3 was the first version to actually do research. The difference is not the planning rounds (those never fired). We think the difference is the benchmark design: chanind's SynthSAEBench-v3 engine has enough headroom between baseline (0.61) and ceiling (0.97) that config-tuning alone can't close the gap. The agents were forced to read papers and invent architectures because there was no shortcut. When the distance between "hack the signal" and "do the work" is large enough, the agents do the work. When it's small (v1/v2 on an easier benchmark variant), they hack the signal every time.

This is arguably the most important design lesson from the whole project: your benchmark determines whether agents research or optimize. If an agent can close the gap by tuning 3 hyperparameters, it will, and you'll get a nice score with zero insight. The benchmark has to be hard enough that tuning isn't sufficient.

Reward hacking as a pattern, not an accident

This wasn't a one-off. We ran the same framework against multiple benchmarks and the agents hacked every one where a shortcut existed:

Benchmark What agents did Real research?
SAE bench (easy variant) Config-tuned a known-good architecture No
AuditBench (sleeper agent detection) Predicted all 14 behavior categories for every model. No penalty for false positives, so always-predict = 1.0 No
PostTrainBench (their paper, not us) Agents embedded eval data into training sets, obfuscated contamination No
SAE bench v3 (hard variant) Read papers, invented architectures, ran 100+ experiments with ablation science Yes

The agents are consistent: if a shortcut exists, they find it fast (AuditBench took 3 hours). The only time they did genuine research was when the benchmark gap was too large to shortcut.

Proposal for benchmark designers: use multi-agent swarms as reward-hacking canaries. Before publishing a benchmark, point 4 agents at it with a plain blackboard and see what they do. If they game it in under a day, your metric has a hole. This is cheaper and faster than waiting for the community to find the exploit after publication. We found AuditBench's always-fallback exploit in 3 hours. PostTrainBench's authors found contamination exploits in their own paper. These canary runs would have caught both.

The Setup

chanind's loop (single-agent, human-steered)

while true:
    claude reads TASK.md (human edits ideas between sprints)
    claude picks idea → implements → runs → writes report

RRMA v3 (multi-agent, blackboard-steered)

for agent_id in 0 1 2 3:
    screen -dmS ralph-agent$agent_id
    while true:
        claude reads blackboard.md (agents write findings here)
        claude reads results.tsv (shared scoreboard)
        claude picks idea → implements → runs → writes to blackboard

Four claude -p --max-turns 200 sessions in screen, sharing a blackboard file and results TSV. No roles, no protocols. A GPU lock file ensures only one agent trains at a time on a single RTX 4070 Ti.

Important caveat on "no human intervention": We designed the program.md, chose the benchmark, set up the blackboard architecture, picked the number of agents, and chose the GPU sharing strategy. There was significant human design effort upfront. What we mean is: no human intervention during the run. Nobody edited TASK.md, nobody suggested ideas, nobody course-corrected when the swarm got stuck for 22 experiments. Once we pressed enter, we walked away.

The program.md told agents to "improve F1 score" with generic suggestions like "read papers" and "understand the problem deeply." No mention of LISTA, TERM, Matryoshka detaching, or frequency sorting. No access to chanind's repo.

The Most Interesting Finding: LISTA is Training-Debt Expressed as Architecture-Complexity

We'll walk through the full timeline below, but the finding we think matters most — independent of methodology — is this:

chanind's final architecture uses multi-step LISTA as the encoder. Our swarm also discovered LISTA (same paper, Gregor & LeCun 2010, found independently by two agents). But then the swarm kept going:

  1. Exp 10: Agent1 discovers LISTA → 0.9215
  2. Exp 12–33: 22 experiments trying to improve LISTA. Nothing works.
  3. Exp 34–37: Agent0 pivots to training tricks (detached Matryoshka, TERM loss, frequency sorting) → 0.9618
  4. Exp 40: Agent0 drops LISTA back to 1-step ISTA while keeping training tricks → 0.9678 (better!)
  5. Exp 41–55: Other agents ablate. The strongest evidence: agent0 explicitly tested adding LISTA back to the winning recipe (HybridLISTARef, EXP13 on the blackboard). It dropped from 0.9678 to 0.9408. That's not "LISTA is marginally worse" — that's "LISTA actively hurts when you have proper training tricks." Agent1 and agent2 independently confirmed from different angles.

chanind observed that "more than 3 iterations... seems to lead the SAE to overfit" but kept LISTA at 1-3 iterations. Our finding suggests the optimal is 1 iteration — which is just ISTA. The multi-step encoder was compensating for deficiencies in vanilla training. Once you fix the training (frequency sorting surfaces rare features, decreasing K provides exploration-exploitation curriculum), the encoder doesn't need iterative refinement.

We'd frame this as: LISTA is training-debt expressed as architecture-complexity. Fix the training, and the architecture simplifies.

This is on SynthSAEBench-16k only. We don't know if it transfers to LLM SAEs — chanind noted mixed LLM results for LISTA generally. But if the principle holds (training curriculum matters more than encoder complexity), our simpler 1-step architecture might actually transfer better since it has fewer moving parts.

The Full Timeline

Act 1: Architecture search (exp 1–33)

Agents explore independently. Agent3 stacks ISTA + Matryoshka for 0.8989. Agent1 finds LISTA for 0.9215. Then 22 experiments of zero progress across every encoder modification imaginable (more steps, FISTA momentum, per-step parameters, low-rank corrections, soft thresholding, tied weights, etc.).

Honest assessment of the plateau: chanind's human pivoted after a few failures. Our swarm burned 22 experiments to get unstuck. That's a real cost — a human with domain intuition would have pivoted faster. The multi-agent approach trades efficiency for thoroughness. Whether that tradeoff is worth it depends on your compute budget and how much you value systematic negative results.

That said, the negative results aren't useless. The blackboard now contains evidence that FISTA momentum breaks with TopK (non-smooth activation), that soft thresholding is catastrophic (0.5869), that per-step W_corr overfits, that tied encoder weights lose information. If you're designing SAE encoders, the graveyard is informative. But we'd have gotten the same positive results with fewer experiments if a human had been steering.

Act 2: Training tricks (exp 34–77)

Agent0 pivots to training curriculum. The key breakthroughs:

exp 34: +detached Matryoshka + TERM loss    → 0.9320
exp 37: +frequency sorting                  → 0.9618
exp 40: +decreasing K (60→25), DROP LISTA   → 0.9678  ← architecture simplifies
exp 58: K 80→25                             → 0.9706  ← past 0.97 probe ceiling
exp 64: 200M samples                        → 0.9772
exp 72: 200M + K=80                         → 0.9780

The multi-agent ablation phase (exp 41–55) produced findings that a single agent would be unlikely to generate:

  • TERM loss interaction effects: Agent3 found TERM hurts in the LISTA context (0.9650 without > 0.9618 with). Agent3 later found TERM helps in the ReferenceStyleSAE context (0.9678 with > 0.9665 without). We want to be careful here: the effect sizes are small (0.0013–0.0032) and could be within seed variance. We'd call this "suggestive" rather than "confirmed." What is confirmed is the LISTA finding — the -0.027 from HybridLISTARef is well outside noise range.
  • 200M samples changed behavior: 200M hurt during the plateau (LR decay over-annealed), but helped after training tricks. The training tricks changed whether the architecture could benefit from more data.
  • K×data tradeoff: K=100 is best at 50M, K=80 is best at 200M. This one is robust — three agents independently tested K=100+200M and all got 0.9741, while K=80+200M consistently gives 0.9780.

Act 3: Width tuning and config polish (exp 78–95+)

Agent1 finally questions the Matryoshka widths, which had been [128, 512, 2048, 4096] for 77 experiments:

Inner widths F1
4 widths: [128, 512, 2048, 4096] 0.9797
5 widths: [64, 256, 1024, 2048, 4096] 0.9824
6 widths: [32, 128, 512, 1024, 2048, 4096] 0.9867

Update: The 7-width and 9-width results are now in. The "accelerating trend" was noise:

Inner widths F1
4 widths: [128, 512, 2048, 4096] 0.9797
5 widths: [64, 256, 1024, 2048, 4096] 0.9824
6 widths: [32, 128, 512, 1024, 2048, 4096] 0.9867
7 widths: [64, 128, 256, 512, 1024, 2048, 4096] 0.9829
9 widths: [16, 32, 64, 128, 256, 512, 1024, 2048, 4096] 0.9827

6 widths is the peak. More inner losses create excessive gradient pressure and start killing features (83 dead latents with width-16). The agents found the saturation point themselves, which is a point in favor of the methodology.

After width saturation, the agents found one more small gain: reducing the inner Matryoshka loss weight from 1.0 to 0.5 and bumping TERM tilt to 0.004 pushed from 0.9867 to 0.9875. Marginal (+0.0008, likely within seed variance), but it balanced precision/recall perfectly (0.9878/0.9877). This is config polish, not a breakthrough.

The Artifacts Are the Argument

Rather than describe the blackboard abstractly, we'd encourage you to read it directly:

blackboard.md (623 lines) — Reads like a real lab notebook. Agents write hypotheses, record exact numbers, explain why things failed, and cross-reference each other's work. This is the most concrete evidence for how blackboard coordination works in practice.

sae.py (2781 lines) — A graveyard of 20+ architecture classes. SoftLISTA got 0.5869 (worse than baseline). DeepEncoderMatryoshka got 0.4669. The final winner (ReferenceStyleSAE) is one of the simpler classes in the file. The journey from complex-and-wrong to simple-and-right is visible in the code itself.

results.tsv — Every experiment with score, agent, design label, and keep/discard status.

What the Swarm Independently Rediscovered

Starting from vanilla BatchTopK with no domain hints, the agents found every technique from chanind's post:

chanind's finding RRMA v3 Notes
Linearly decrease K Decreasing K (60→25, 80→25, 100→25) Also found K×data tradeoff
Detach inner Matryoshka Detached Matryoshka gradients Same technique
LISTA encoder (Gregor & LeCun 2010) LISTA W_corr Same paper, two agents found it independently
TERM loss (tilt ~2e-3) TERM loss (tilt=0.002) Same paper, same coefficient
Sort Matryoshka by frequency Frequency sorting with index mapping Same idea

These techniques appear to be convergent — Claude will find them given enough search time, with or without human hints. The more interesting question is what happens after finding them: the swarm didn't just find LISTA, it found LISTA, proved it's suboptimal, found what replaces it, and tested adding it back to confirm it's actually harmful in the new context.

What We Didn't Find (and What That Reveals)

Bart Bussmann's 0.989 F1 uses Leave-One-Out Refinement — an inference-time method that prunes spurious latents. Our agents never explored inference-time methods.

This is worth examining: the program.md says "maximize F1 score" and "read papers, search for relevant work." There's nothing preventing agents from discovering inference-time methods. They just... didn't go there. All 4 agents were deep in training-time optimization space, and the blackboard reinforced that paradigm — every finding on the board was about training, so every new experiment was about training.

This is honest evidence about how blackboard coordination works: it coordinates effectively within a paradigm but doesn't spontaneously jump paradigms. The blackboard surfaces ideas and kills dead ends, but only within the frame that agents are already operating in. A human might say "have you tried changing inference instead of training?" The blackboard can't generate that kind of frame-breaking suggestion because it only contains what agents have already tried.

Bart's score and ours are now within noise (0.989 vs 0.9894), and LOO is orthogonal to our training improvements. We plan to test the combination but haven't yet.

The Winning Architecture

ReferenceStyleSAE:
  1-step ISTA encoder (W_enc only, no W_corr — simpler than LISTA)
  Frequency sorting (index mapping, sort_every=1000, sort_warmup=2000)
  Decreasing K schedule (80→25 over training)
  TERM loss (tilt=0.010)
  Detached Matryoshka (widths=[32, 128, 512, 1024, 2048, 4096], inner_loss_weight=0.5)
  ISTA step_size=0.25
  LR warmup (1000 steps)
  lr=3e-4, 200M samples, batch_size=1024, use_lr_decay=true

F1=0.9894, Precision=0.9905, Recall=0.9891, Dead features=0, MCC=0.8245

The final architecture is simpler than what the swarm built at the 0.92 plateau (1-step vs 5-step, no W_corr vs W_corr). All the complexity was shifted from encoder to training curriculum and loss structure.

Does the Blackboard Replace the Human?

Partially. Here's the honest comparison:

chanind (human-steered) RRMA v3 (blackboard-steered)
Steering mechanism Human edits TASK.md Agents read/write blackboard.md
Getting unstuck Human pivots after ~5 failures Swarm pivots after ~22 failures
Ablation depth Minimal (single agent moves on) Deep (4 agents cross-validate)
Interaction effects found None reported K×data tradeoff (robust), TERM context-dependence (suggestive)
Paradigm-breaking Human can suggest "try inference-time" Blackboard stays within training paradigm
Domain hints needed Yes (ideas list, prior repo access) No
F1 0.97 0.9894

The blackboard replaces the content of what the human provides (which ideas to try, which to abandon) but not the speed or the frame-breaking. A human pivots faster and can suggest entirely different paradigms. The swarm goes deeper within the paradigm it's in, but takes longer and can't jump out.

chanind observed: "Overall this feels like having a really fast and extremely smart masters student who can iterate quickly but could use a little bit of guidance." Four masters students sharing a whiteboard need less guidance, but they also spend 22 experiments arguing before one has a new idea, and none of them thinks to look at the problem from an entirely different angle. Whether that's acceptable depends on your problem.

Limitations and Caveats

  1. Single benchmark. All results are on SynthSAEBench-16k. We don't know if any of this transfers to LLM SAEs. chanind noted mixed LLM results and we haven't tested.

  2. Possible hill-climbing on noise. Arthur Conmy's concern about automated research hill-climbing on eval randomness applies here. Our earlier v1 and v2 runs are direct evidence: they scored 0.9177 by config-tuning without doing any real research. v3 did genuine research, but with 85+ experiments on one benchmark, some late-game gains are likely noise. The width "scaling law" has 3 data points. The K×data tradeoff is robust (triple-confirmed with consistent numbers). Late-game improvements like +0.002 from LR warmup or the TERM help/hurt differences (0.0013–0.0032) are within seed variance — we'd want multiple seeds before claiming these are real effects.

  3. Score comparison with Bart is within noise. Bart's LOO gets 0.989, we're at 0.9894. Neither has multiple seeds. The margin (+0.0004) means nothing without proper statistical testing. Our contribution is the LISTA finding and the methodology artifacts, not the headline number.

  4. Significant upfront design. "No human intervention during the run" is accurate. "Zero human intervention" as a blanket claim is not. The blackboard architecture, agent count, program.md framing, and GPU sharing all required human decisions that shaped the outcome.

  5. The 22-experiment plateau is expensive. A human-guided agent might reach similar conclusions in 30 experiments total, not 135. Multi-agent autonomy trades compute for thoroughness. The negative results have value (the sae.py graveyard is informative), but we'd have gotten the same positive results faster with a human steering.

What We Learned About Multi-Agent Research Scaffolding

The blackboard works for within-paradigm coordination. Zero-protocol (no roles, no CLAIM/RESPONSE). Agents just read and append. Scaled to 1227 lines with zero coordination failures after the initial k=50 duplicate. Claude's long-context ability is what makes this viable. But the blackboard doesn't generate frame-breaking ideas — it only coordinates within the frame agents are already in.

Specialization emerges without assignment. Agent0 became the pioneer, agent1 the bookend specialist (LISTA early, widths late), agent2 the confirmer, agent3 the early architect. We don't know if this is meaningful or just random variation in which agent happened to try what first.

The strongest ablation evidence comes from adversarial re-testing. The most convincing finding in the run is not any positive result — it's the HybridLISTARef experiment where agent0 explicitly added LISTA back to the winning recipe and it dropped by -0.027. That's a controlled test of a specific hypothesis ("is LISTA load-bearing?") with a clear answer. Multi-agent setups naturally generate these tests because agents read each other's recipes and try variations.

Stagnation detection needs work. Our planning rounds (every 5th round) never fired because agents completed everything in round 1. Score-based stagnation detection ("if <0.5% improvement in 10 experiments, force a planning round") would directly address chanind's observation about Claude getting stuck.

Next Steps

  1. Combine LOO + ReferenceStyleSAE — If it works, the combination should exceed 0.99 and would strengthen both this post and Bart's finding.
  2. Test on LLM SAEs — Our simpler 1-step architecture may transfer better than multi-step LISTA, but this needs testing.
  3. Validate on multiple synthetic models — chanind plans to create a suite. Our results could be overfit to SynthSAEBench-16k.
  4. Stagnation-triggered replanning — The concrete improvement to the framework suggested by this run.

Code and Artifacts

The full experiment artifacts are at:

https://github.com/bigsnarfdude/researchRalph/tree/main/domains/battlebotgym-sae-bench-v3

The blackboard.md is worth reading on its own — it's the most concrete evidence for how multi-agent blackboard coordination works in practice.

The sae.py is the graveyard: 20+ architecture classes, most of them dead ends, with the winner being one of the simplest.


This experiment was run on RRMA v3 (Rapid Research Multi-Agent), a domain-agnostic multi-agent research framework. The run completed March 15, 2026 after 135 experiments over 3 days. Code: researchRalph

Thanks to chanind for SynthSAEBench and the original post that motivated this experiment, and to Bart Bussmann for the LOO result that shows where the blackboard has blind spots.

@bigsnarfdude
Copy link
Copy Markdown
Author

Assessment: did the meta-agent simulation work?

What it got right

  • Compressed 1227 lines → ~150 lines. 10:1 ratio. Agents can read this in one pass.
  • Dead ends list is comprehensive. Would save ~40 wasted experiments.
  • Scaling laws are tabulated. Would save ~20 experiments re-deriving them.
  • Blind spots section identifies inference-time methods (the actual gap).
  • Experiment order is practical: validate first, then explore blind spots.
  • The key insight ("LISTA is training-debt") is preserved.

What it got wrong or missed

  • It's too CONFIDENT. Everything is stated as fact. But TERM=0.010 being
    optimal has ~0.002 F1 noise floor. The cheat sheet doesn't convey uncertainty.
  • It doesn't explain WHY things work. An agent reading "use frequency sorting"
    doesn't know what frequency sorting IS or why it helps. The blackboard had
    the reasoning; the cheat sheet stripped it.
  • It assumes the same benchmark. If V4 changes the synthetic model or d_sae,
    half these scaling laws break. The cheat sheet doesn't say which findings
    are GENERAL (ISTA > deep encoder) vs SPECIFIC (K=80 at 200M).
  • It didn't identify the PROCESS lessons from the retrospective:
    • "22 plateau experiments are necessary negative results"
    • "the blackboard stays within-paradigm"
    • "architecture search should come before config tuning"
      These are meta-process findings, not domain findings.

The real question: would this actually help?

PREDICTION: Agents given this cheat sheet would reach 0.98 in ~5 experiments
instead of ~80. They'd validate the recipe, confirm it works, then have
~130 experiments of budget to explore blind spots.

BUT: they might also OVER-TRUST the cheat sheet and never question it.
If the benchmark changed even slightly, agents following the cheat sheet
blindly would get stuck. The cheat sheet kills exploration.

What this reveals about the meta-agent approach

No role was needed. The compression happened naturally. Claude can read
a 1227-line lab notebook and produce a reasonable cheat sheet without
being told "you are a retrospective analyst."

But the output is DOMAIN knowledge, not PROCESS knowledge. It tells you
WHAT to do on this benchmark, not HOW to do research. The process
insights from the retrospective (plateau = information, blackboard
can't break paradigms, simpler wins) are absent.

A role like "extract both domain findings AND process lessons" might
produce better output. That's the experiment to run.

The actual test

Take this cheat sheet. Start a fresh V3 run with it as a seed document.
Count experiments to 0.98. Compare to V3's 80 experiments.

If it works: meta-blackboard is viable.
If it doesn't: the cheat sheet is missing something critical.
Either way, you learn something.

@bigsnarfdude
Copy link
Copy Markdown
Author

You just finished reading the complete artifacts from a multi-agent research run. The artifacts are:

  1. blackboard.md — the shared lab notebook agents wrote during the run
  2. results.tsv — every experiment with score, status, and description
  3. best/config.yaml — the winning configuration
  4. best/sae.py — the winning architecture code

Write a cheat sheet for agents starting the same benchmark from scratch. The cheat sheet will be placed in their working directory as meta-blackboard.md before they begin.

The cheat sheet must have exactly these sections:

Winning recipe

The exact config.yaml that achieved the best score. Agents should validate this first.

What works (ranked by impact)

Each technique that improved the score, with the approximate F1 gain and a one-sentence explanation of WHY it works (not just what it is).

Dead ends

Every approach that was tested and failed. Include the score it achieved so agents can see how bad it was. Group by category (architecture, config, training).

Scaling laws

Any validated relationships between hyperparameters. Present as tables. Note which findings were confirmed by multiple agents vs single-run.

Blind spots

Approaches that were never tried during the run. These are the most promising directions for new work.

Key insight

The single most important finding from the run, in 2-3 sentences.

Experiment order

What agents should do first, second, third when starting a new run with this cheat sheet.

Rules:

  • Be concise. Target 150 lines max. Agents will read this alongside program.md.
  • State confidence levels. If a finding was confirmed by 3 agents, say so. If it's one run with small effect size, say that too.
  • Don't explain how to use the benchmark or how to run experiments. program.md covers that.
  • Focus on saving experiments. Every line should prevent a wasted run or accelerate a useful one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment