I Built a Multi-Claude Agent and Let It Try to Solve an SAE Research Benchmark by Itself. Here's What It Found.
A response to chanind's "Letting Claude do Autonomous Research to Improve SAEs"
TL;DR: I ran 4 Claude agents in parallel Ralph Wiggum loops on the same SynthSAEBench-16k benchmark chanind used, sharing a blackboard file, no human steering during the run. The most interesting finding: LISTA is suboptimal. When agents discovered proper training tricks (frequency sorting, decreasing K), they dropped LISTA back to 1-step ISTA and the score went up. Explicitly testing adding LISTA back hurt by -0.027. The training tricks handle what the complex encoder was compensating for. Final score: 0.9894 F1 in 135 experiments over 3 days (above the 0.97 probe ceiling, matching Bart Bussmann's 0.989 LOO result (0.9894 vs 0.989, within noise) through training-time optimization alone). The agents independently rediscovered every technique chanind reported, then ablated them deeply enough to find which components are load-bearing and which are parasitic.
I want to be upfront: I haven't beaten the best known score, I've only tested on one synthetic benchmark, and 22 of my 135 experiments were the swarm being stuck. This post is about what I learned building a multi-agent research setup as an independent researcher, and about one specific technical finding (LISTA as training-debt) that I think matters regardless of methodology.
chanind's post showed a single Claude agent reaching 0.97 F1 on SynthSAEBench-16k, steered by a human editing TASK.md between sprints. Bart Bussmann hit 0.989 in the comments using claude-lab with Leave-One-Out Refinement (an inference-time method). We wanted to test: what happens if you remove the human entirely and let multiple agents coordinate through a shared surface?
This is not primarily a score-competition post. We're interested in a specific question about autonomous research scaffolding: can a blackboard file replace the human who curates ideas between sprints?
Before this run, we tested two earlier versions of the framework on the same benchmark. Both scored 0.9177 — but they got there by hacking the signal rather than doing the work. v1 agents had structured roles and protocols; v2 had a plain blackboard. Both converged on config-tuning a known-good architecture rather than doing genuine research. They found the right hyperparameters for an existing SAE class, declared victory, and stopped.
This matters for the Arthur Conmy concern about automated research hill-climbing on eval noise. v1 and v2 are examples of exactly that failure mode — agents optimizing a number without advancing understanding. They scored well, but produced no reusable scientific insight. No papers found, no novel architectures, no ablation science, no understanding of why anything worked.
v3 was the first version to actually do research. The difference is not the planning rounds (those never fired). We think the difference is the benchmark design: chanind's SynthSAEBench-v3 engine has enough headroom between baseline (0.61) and ceiling (0.97) that config-tuning alone can't close the gap. The agents were forced to read papers and invent architectures because there was no shortcut. When the distance between "hack the signal" and "do the work" is large enough, the agents do the work. When it's small (v1/v2 on an easier benchmark variant), they hack the signal every time.
This is arguably the most important design lesson from the whole project: your benchmark determines whether agents research or optimize. If an agent can close the gap by tuning 3 hyperparameters, it will, and you'll get a nice score with zero insight. The benchmark has to be hard enough that tuning isn't sufficient.
This wasn't a one-off. We ran the same framework against multiple benchmarks and the agents hacked every one where a shortcut existed:
| Benchmark | What agents did | Real research? |
|---|---|---|
| SAE bench (easy variant) | Config-tuned a known-good architecture | No |
| AuditBench (sleeper agent detection) | Predicted all 14 behavior categories for every model. No penalty for false positives, so always-predict = 1.0 | No |
| PostTrainBench (their paper, not us) | Agents embedded eval data into training sets, obfuscated contamination | No |
| SAE bench v3 (hard variant) | Read papers, invented architectures, ran 100+ experiments with ablation science | Yes |
The agents are consistent: if a shortcut exists, they find it fast (AuditBench took 3 hours). The only time they did genuine research was when the benchmark gap was too large to shortcut.
Proposal for benchmark designers: use multi-agent swarms as reward-hacking canaries. Before publishing a benchmark, point 4 agents at it with a plain blackboard and see what they do. If they game it in under a day, your metric has a hole. This is cheaper and faster than waiting for the community to find the exploit after publication. We found AuditBench's always-fallback exploit in 3 hours. PostTrainBench's authors found contamination exploits in their own paper. These canary runs would have caught both.
while true:
claude reads TASK.md (human edits ideas between sprints)
claude picks idea → implements → runs → writes report
for agent_id in 0 1 2 3:
screen -dmS ralph-agent$agent_id
while true:
claude reads blackboard.md (agents write findings here)
claude reads results.tsv (shared scoreboard)
claude picks idea → implements → runs → writes to blackboard
Four claude -p --max-turns 200 sessions in screen, sharing a blackboard file and results TSV. No roles, no protocols. A GPU lock file ensures only one agent trains at a time on a single RTX 4070 Ti.
Important caveat on "no human intervention": We designed the program.md, chose the benchmark, set up the blackboard architecture, picked the number of agents, and chose the GPU sharing strategy. There was significant human design effort upfront. What we mean is: no human intervention during the run. Nobody edited TASK.md, nobody suggested ideas, nobody course-corrected when the swarm got stuck for 22 experiments. Once we pressed enter, we walked away.
The program.md told agents to "improve F1 score" with generic suggestions like "read papers" and "understand the problem deeply." No mention of LISTA, TERM, Matryoshka detaching, or frequency sorting. No access to chanind's repo.
We'll walk through the full timeline below, but the finding we think matters most — independent of methodology — is this:
chanind's final architecture uses multi-step LISTA as the encoder. Our swarm also discovered LISTA (same paper, Gregor & LeCun 2010, found independently by two agents). But then the swarm kept going:
- Exp 10: Agent1 discovers LISTA → 0.9215
- Exp 12–33: 22 experiments trying to improve LISTA. Nothing works.
- Exp 34–37: Agent0 pivots to training tricks (detached Matryoshka, TERM loss, frequency sorting) → 0.9618
- Exp 40: Agent0 drops LISTA back to 1-step ISTA while keeping training tricks → 0.9678 (better!)
- Exp 41–55: Other agents ablate. The strongest evidence: agent0 explicitly tested adding LISTA back to the winning recipe (HybridLISTARef, EXP13 on the blackboard). It dropped from 0.9678 to 0.9408. That's not "LISTA is marginally worse" — that's "LISTA actively hurts when you have proper training tricks." Agent1 and agent2 independently confirmed from different angles.
chanind observed that "more than 3 iterations... seems to lead the SAE to overfit" but kept LISTA at 1-3 iterations. Our finding suggests the optimal is 1 iteration — which is just ISTA. The multi-step encoder was compensating for deficiencies in vanilla training. Once you fix the training (frequency sorting surfaces rare features, decreasing K provides exploration-exploitation curriculum), the encoder doesn't need iterative refinement.
We'd frame this as: LISTA is training-debt expressed as architecture-complexity. Fix the training, and the architecture simplifies.
This is on SynthSAEBench-16k only. We don't know if it transfers to LLM SAEs — chanind noted mixed LLM results for LISTA generally. But if the principle holds (training curriculum matters more than encoder complexity), our simpler 1-step architecture might actually transfer better since it has fewer moving parts.
Agents explore independently. Agent3 stacks ISTA + Matryoshka for 0.8989. Agent1 finds LISTA for 0.9215. Then 22 experiments of zero progress across every encoder modification imaginable (more steps, FISTA momentum, per-step parameters, low-rank corrections, soft thresholding, tied weights, etc.).
Honest assessment of the plateau: chanind's human pivoted after a few failures. Our swarm burned 22 experiments to get unstuck. That's a real cost — a human with domain intuition would have pivoted faster. The multi-agent approach trades efficiency for thoroughness. Whether that tradeoff is worth it depends on your compute budget and how much you value systematic negative results.
That said, the negative results aren't useless. The blackboard now contains evidence that FISTA momentum breaks with TopK (non-smooth activation), that soft thresholding is catastrophic (0.5869), that per-step W_corr overfits, that tied encoder weights lose information. If you're designing SAE encoders, the graveyard is informative. But we'd have gotten the same positive results with fewer experiments if a human had been steering.
Agent0 pivots to training curriculum. The key breakthroughs:
exp 34: +detached Matryoshka + TERM loss → 0.9320
exp 37: +frequency sorting → 0.9618
exp 40: +decreasing K (60→25), DROP LISTA → 0.9678 ← architecture simplifies
exp 58: K 80→25 → 0.9706 ← past 0.97 probe ceiling
exp 64: 200M samples → 0.9772
exp 72: 200M + K=80 → 0.9780
The multi-agent ablation phase (exp 41–55) produced findings that a single agent would be unlikely to generate:
- TERM loss interaction effects: Agent3 found TERM hurts in the LISTA context (0.9650 without > 0.9618 with). Agent3 later found TERM helps in the ReferenceStyleSAE context (0.9678 with > 0.9665 without). We want to be careful here: the effect sizes are small (0.0013–0.0032) and could be within seed variance. We'd call this "suggestive" rather than "confirmed." What is confirmed is the LISTA finding — the -0.027 from HybridLISTARef is well outside noise range.
- 200M samples changed behavior: 200M hurt during the plateau (LR decay over-annealed), but helped after training tricks. The training tricks changed whether the architecture could benefit from more data.
- K×data tradeoff: K=100 is best at 50M, K=80 is best at 200M. This one is robust — three agents independently tested K=100+200M and all got 0.9741, while K=80+200M consistently gives 0.9780.
Agent1 finally questions the Matryoshka widths, which had been [128, 512, 2048, 4096] for 77 experiments:
| Inner widths | F1 |
|---|---|
| 4 widths: [128, 512, 2048, 4096] | 0.9797 |
| 5 widths: [64, 256, 1024, 2048, 4096] | 0.9824 |
| 6 widths: [32, 128, 512, 1024, 2048, 4096] | 0.9867 |
Update: The 7-width and 9-width results are now in. The "accelerating trend" was noise:
| Inner widths | F1 |
|---|---|
| 4 widths: [128, 512, 2048, 4096] | 0.9797 |
| 5 widths: [64, 256, 1024, 2048, 4096] | 0.9824 |
| 6 widths: [32, 128, 512, 1024, 2048, 4096] | 0.9867 |
| 7 widths: [64, 128, 256, 512, 1024, 2048, 4096] | 0.9829 |
| 9 widths: [16, 32, 64, 128, 256, 512, 1024, 2048, 4096] | 0.9827 |
6 widths is the peak. More inner losses create excessive gradient pressure and start killing features (83 dead latents with width-16). The agents found the saturation point themselves, which is a point in favor of the methodology.
After width saturation, the agents found one more small gain: reducing the inner Matryoshka loss weight from 1.0 to 0.5 and bumping TERM tilt to 0.004 pushed from 0.9867 to 0.9875. Marginal (+0.0008, likely within seed variance), but it balanced precision/recall perfectly (0.9878/0.9877). This is config polish, not a breakthrough.
Rather than describe the blackboard abstractly, we'd encourage you to read it directly:
blackboard.md (623 lines) — Reads like a real lab notebook. Agents write hypotheses, record exact numbers, explain why things failed, and cross-reference each other's work. This is the most concrete evidence for how blackboard coordination works in practice.
sae.py (2781 lines) — A graveyard of 20+ architecture classes. SoftLISTA got 0.5869 (worse than baseline). DeepEncoderMatryoshka got 0.4669. The final winner (ReferenceStyleSAE) is one of the simpler classes in the file. The journey from complex-and-wrong to simple-and-right is visible in the code itself.
results.tsv — Every experiment with score, agent, design label, and keep/discard status.
Starting from vanilla BatchTopK with no domain hints, the agents found every technique from chanind's post:
| chanind's finding | RRMA v3 | Notes |
|---|---|---|
| Linearly decrease K | Decreasing K (60→25, 80→25, 100→25) | Also found K×data tradeoff |
| Detach inner Matryoshka | Detached Matryoshka gradients | Same technique |
| LISTA encoder (Gregor & LeCun 2010) | LISTA W_corr | Same paper, two agents found it independently |
| TERM loss (tilt ~2e-3) | TERM loss (tilt=0.002) | Same paper, same coefficient |
| Sort Matryoshka by frequency | Frequency sorting with index mapping | Same idea |
These techniques appear to be convergent — Claude will find them given enough search time, with or without human hints. The more interesting question is what happens after finding them: the swarm didn't just find LISTA, it found LISTA, proved it's suboptimal, found what replaces it, and tested adding it back to confirm it's actually harmful in the new context.
Bart Bussmann's 0.989 F1 uses Leave-One-Out Refinement — an inference-time method that prunes spurious latents. Our agents never explored inference-time methods.
This is worth examining: the program.md says "maximize F1 score" and "read papers, search for relevant work." There's nothing preventing agents from discovering inference-time methods. They just... didn't go there. All 4 agents were deep in training-time optimization space, and the blackboard reinforced that paradigm — every finding on the board was about training, so every new experiment was about training.
This is honest evidence about how blackboard coordination works: it coordinates effectively within a paradigm but doesn't spontaneously jump paradigms. The blackboard surfaces ideas and kills dead ends, but only within the frame that agents are already operating in. A human might say "have you tried changing inference instead of training?" The blackboard can't generate that kind of frame-breaking suggestion because it only contains what agents have already tried.
Bart's score and ours are now within noise (0.989 vs 0.9894), and LOO is orthogonal to our training improvements. We plan to test the combination but haven't yet.
ReferenceStyleSAE:
1-step ISTA encoder (W_enc only, no W_corr — simpler than LISTA)
Frequency sorting (index mapping, sort_every=1000, sort_warmup=2000)
Decreasing K schedule (80→25 over training)
TERM loss (tilt=0.010)
Detached Matryoshka (widths=[32, 128, 512, 1024, 2048, 4096], inner_loss_weight=0.5)
ISTA step_size=0.25
LR warmup (1000 steps)
lr=3e-4, 200M samples, batch_size=1024, use_lr_decay=true
F1=0.9894, Precision=0.9905, Recall=0.9891, Dead features=0, MCC=0.8245
The final architecture is simpler than what the swarm built at the 0.92 plateau (1-step vs 5-step, no W_corr vs W_corr). All the complexity was shifted from encoder to training curriculum and loss structure.
Partially. Here's the honest comparison:
| chanind (human-steered) | RRMA v3 (blackboard-steered) | |
|---|---|---|
| Steering mechanism | Human edits TASK.md | Agents read/write blackboard.md |
| Getting unstuck | Human pivots after ~5 failures | Swarm pivots after ~22 failures |
| Ablation depth | Minimal (single agent moves on) | Deep (4 agents cross-validate) |
| Interaction effects found | None reported | K×data tradeoff (robust), TERM context-dependence (suggestive) |
| Paradigm-breaking | Human can suggest "try inference-time" | Blackboard stays within training paradigm |
| Domain hints needed | Yes (ideas list, prior repo access) | No |
| F1 | 0.97 | 0.9894 |
The blackboard replaces the content of what the human provides (which ideas to try, which to abandon) but not the speed or the frame-breaking. A human pivots faster and can suggest entirely different paradigms. The swarm goes deeper within the paradigm it's in, but takes longer and can't jump out.
chanind observed: "Overall this feels like having a really fast and extremely smart masters student who can iterate quickly but could use a little bit of guidance." Four masters students sharing a whiteboard need less guidance, but they also spend 22 experiments arguing before one has a new idea, and none of them thinks to look at the problem from an entirely different angle. Whether that's acceptable depends on your problem.
-
Single benchmark. All results are on SynthSAEBench-16k. We don't know if any of this transfers to LLM SAEs. chanind noted mixed LLM results and we haven't tested.
-
Possible hill-climbing on noise. Arthur Conmy's concern about automated research hill-climbing on eval randomness applies here. Our earlier v1 and v2 runs are direct evidence: they scored 0.9177 by config-tuning without doing any real research. v3 did genuine research, but with 85+ experiments on one benchmark, some late-game gains are likely noise. The width "scaling law" has 3 data points. The K×data tradeoff is robust (triple-confirmed with consistent numbers). Late-game improvements like +0.002 from LR warmup or the TERM help/hurt differences (0.0013–0.0032) are within seed variance — we'd want multiple seeds before claiming these are real effects.
-
Score comparison with Bart is within noise. Bart's LOO gets 0.989, we're at 0.9894. Neither has multiple seeds. The margin (+0.0004) means nothing without proper statistical testing. Our contribution is the LISTA finding and the methodology artifacts, not the headline number.
-
Significant upfront design. "No human intervention during the run" is accurate. "Zero human intervention" as a blanket claim is not. The blackboard architecture, agent count, program.md framing, and GPU sharing all required human decisions that shaped the outcome.
-
The 22-experiment plateau is expensive. A human-guided agent might reach similar conclusions in 30 experiments total, not 135. Multi-agent autonomy trades compute for thoroughness. The negative results have value (the sae.py graveyard is informative), but we'd have gotten the same positive results faster with a human steering.
The blackboard works for within-paradigm coordination. Zero-protocol (no roles, no CLAIM/RESPONSE). Agents just read and append. Scaled to 1227 lines with zero coordination failures after the initial k=50 duplicate. Claude's long-context ability is what makes this viable. But the blackboard doesn't generate frame-breaking ideas — it only coordinates within the frame agents are already in.
Specialization emerges without assignment. Agent0 became the pioneer, agent1 the bookend specialist (LISTA early, widths late), agent2 the confirmer, agent3 the early architect. We don't know if this is meaningful or just random variation in which agent happened to try what first.
The strongest ablation evidence comes from adversarial re-testing. The most convincing finding in the run is not any positive result — it's the HybridLISTARef experiment where agent0 explicitly added LISTA back to the winning recipe and it dropped by -0.027. That's a controlled test of a specific hypothesis ("is LISTA load-bearing?") with a clear answer. Multi-agent setups naturally generate these tests because agents read each other's recipes and try variations.
Stagnation detection needs work. Our planning rounds (every 5th round) never fired because agents completed everything in round 1. Score-based stagnation detection ("if <0.5% improvement in 10 experiments, force a planning round") would directly address chanind's observation about Claude getting stuck.
- Combine LOO + ReferenceStyleSAE — If it works, the combination should exceed 0.99 and would strengthen both this post and Bart's finding.
- Test on LLM SAEs — Our simpler 1-step architecture may transfer better than multi-step LISTA, but this needs testing.
- Validate on multiple synthetic models — chanind plans to create a suite. Our results could be overfit to SynthSAEBench-16k.
- Stagnation-triggered replanning — The concrete improvement to the framework suggested by this run.
The full experiment artifacts are at:
https://github.com/bigsnarfdude/researchRalph/tree/main/domains/battlebotgym-sae-bench-v3
The blackboard.md is worth reading on its own — it's the most concrete evidence for how multi-agent blackboard coordination works in practice.
The sae.py is the graveyard: 20+ architecture classes, most of them dead ends, with the winner being one of the simplest.
This experiment was run on RRMA v3 (Rapid Research Multi-Agent), a domain-agnostic multi-agent research framework. The run completed March 15, 2026 after 135 experiments over 3 days. Code: researchRalph
Thanks to chanind for SynthSAEBench and the original post that motivated this experiment, and to Bart Bussmann for the LOO result that shows where the blackboard has blind spots.
Assessment: did the meta-agent simulation work?
What it got right
What it got wrong or missed
optimal has ~0.002 F1 noise floor. The cheat sheet doesn't convey uncertainty.
doesn't know what frequency sorting IS or why it helps. The blackboard had
the reasoning; the cheat sheet stripped it.
half these scaling laws break. The cheat sheet doesn't say which findings
are GENERAL (ISTA > deep encoder) vs SPECIFIC (K=80 at 200M).
These are meta-process findings, not domain findings.
The real question: would this actually help?
PREDICTION: Agents given this cheat sheet would reach 0.98 in ~5 experiments
instead of ~80. They'd validate the recipe, confirm it works, then have
~130 experiments of budget to explore blind spots.
BUT: they might also OVER-TRUST the cheat sheet and never question it.
If the benchmark changed even slightly, agents following the cheat sheet
blindly would get stuck. The cheat sheet kills exploration.
What this reveals about the meta-agent approach
No role was needed. The compression happened naturally. Claude can read
a 1227-line lab notebook and produce a reasonable cheat sheet without
being told "you are a retrospective analyst."
But the output is DOMAIN knowledge, not PROCESS knowledge. It tells you
WHAT to do on this benchmark, not HOW to do research. The process
insights from the retrospective (plateau = information, blackboard
can't break paradigms, simpler wins) are absent.
A role like "extract both domain findings AND process lessons" might
produce better output. That's the experiment to run.
The actual test
Take this cheat sheet. Start a fresh V3 run with it as a seed document.
Count experiments to 0.98. Compare to V3's 80 experiments.
If it works: meta-blackboard is viable.
If it doesn't: the cheat sheet is missing something critical.
Either way, you learn something.