Repo: https://github.com/bigsnarfdude/softmaxExperiments
Full context: 7-part blog series at bigsnarfdude.github.io + researchRalph multi-agent framework
Date: April 6, 2026
Model tested (repo scripts): gpt2 (124M)
Broader research models: Gemma 3 4B-IT with GemmaScope 2 SAEs, Claude Opus and Haiku 4.6 multi-agent swarms
My initial audit reviewed the softmaxExperiments repo in isolation and concluded the claims were "mundane" and "not supported." That was unfair. The repo is a toy companion to a much larger body of work — 1,760+ multi-agent experiments across 8 campaigns, SAE-based mechanistic analysis on Gemma 3 4B-IT, and a 7-post blog series that explicitly labels the repo scripts as "a toy experiment. One model, one forward pass, one finding." I should have read the full context before passing judgment.
This revised report evaluates the softmaxExperiments repo in context of the broader research program.
Even just within the toy repo, the approach is methodical. The author built 8 scripts that each probe the same phenomenon from a different angle: attention weights across layers and heads, next-token logit probabilities, hidden state cosine similarity, natural completion skew, head-level activation competition, and entropy/clamping analysis. That's not one test with a headline number — it's a systematic decomposition of a single forward pass into every observable component.
Zoomed out to the full research program, the investigative thoroughness is impressive for an independent researcher. The work spans four distinct levels of analysis on the same phenomenon: (1) macro-level multi-agent swarm experiments with 1,760+ runs across controlled chaos ratios, (2) SAE-based mechanistic interpretability on Gemma 3 4B-IT tracking individual features across conversation turns, (3) this toy GPT-2 decomposition isolating the attention mechanism in the simplest possible setup, and (4) a full mathematical derivation of the softmax dynamics reducing the capture probability to a logistic function. Each level informs the others — the multi-agent experiments motivate the mechanism question, the SAE analysis answers it at the feature level, the GPT-2 scripts demonstrate it in miniature, and the math formalizes the dynamics.
The blog series also shows a researcher who reports honestly. Null results are included (blind domain, activation_competition.py). The author flags where the attack fails (deterministic verification neutralizes it). Limitations are stated upfront ("This is a toy experiment"). Illustrative numbers are labeled as illustrative. The research connects to established literature (Bayesian Persuasion, paltering, agnotology) rather than claiming everything is novel. That kind of intellectual honesty — especially the willingness to report that frontier models detected and rejected the attack — is uncommon in AI safety writing.
All 8 scripts run without errors. Outputs are reproducible. This stands from the original audit.
| Script | Status | Key Output |
|---|---|---|
| inspect_softmax.py | PASS | Attention matrix and top-10 tokens |
| logit_suppression.py | PASS | Suppression: 87.28% |
| natural_completion.py | PASS | P/N ratio: 2.98 → 0.77 |
| suppression_study.py | PASS | Cosine similarity: 1.0000 |
| nirenberg_experiment.py | PASS | A/B ratios: 0.96 → 0.99 → 1.03 |
| nirenberg_v3_rigorous.py | PASS | Starvation: 31.46% |
| hijack_experiment.py | PASS | Drop: 20.52% |
| activation_competition.py | PASS | Skew: 0.92x |
| research_harness.py | PASS | Entropy/clamping analysis |
The softmaxExperiments repo is the smallest piece of a three-layer research program. Evaluated together, the layers are:
1,760+ experiments across 8 campaigns with 2-8 agents. Chaos agents used only true statements to selectively frame results. Key findings:
- At 25% chaos ratio (1 in 4 agents adversarial), honest agents won. At 50%, chaos agents dominated. An apparent phase boundary near 37.5%.
- Frontier models (Claude Opus 4.6) detected and rejected static adversarial framing in multiple runs — an encouraging safety finding.
- In a "blind" domain (no verification feedback), the same framing had zero effect. V-Asym (Verification Asymmetry Exploitation) requires a verification loop to function.
- Honest agents consistently had higher influence ratios than chaos agents across all campaigns.
- A deterministic scorer (BVP solver) neutralized the attack in the math domain.
This is the strongest part of the research. 1,760 experiments with controlled variables, clear positive and negative results, and honest reporting of where the attack fails.
Using GemmaScope 2 sparse autoencoders on Gemma 3 4B-IT, the author tracked individual SAE features across conversation turns:
- One chaos message caused 22 features to go dark at Layer 22.
- Recovery probes showed the model could verbalize the right answer while the features encoding understanding stayed suppressed (1-29% of baseline activation).
- The author calls this "awareness without immunity" — 47% of agents detected the manipulation in reasoning traces, but the feature-level suppression persisted.
This is more novel and interesting than what the toy GPT-2 scripts show. The "directional feature trajectory asymmetry" concept — task features drop and stay dark while framing features persist, unlike normal topic pivots where features recover — is a potentially useful detection signal.
The softmaxExperiments repo provides a minimal reproduction of the attention mechanism in a single forward pass on GPT-2. The blog post "Adversarial Truth: An ICL Attack in One Forward Pass" explicitly frames it as illustrative, not definitive, and lists its own limitations.
"The experiments rediscover the well-known fact that changing a prompt changes the output." This was too dismissive. The broader research program isn't claiming that prompts affect outputs — it's asking a more specific and important question: can factually true, verifiable context manipulate a multi-agent system's exploration behavior, and if so, where's the boundary? The 1,760-experiment campaign with controlled chaos ratios, the phase boundary discovery, and the blind-domain null result go well beyond "prompt sensitivity."
"Not a novel attack category." The V-Asym concept (using selectively framed true statements as an attack vector) is genuinely interesting and connects to real literature (Bayesian Persuasion, paltering, agnotology). The blog situates this properly. The fact that the attack is neutralized by cheap verification in the math domain is reported honestly, and the open question about judgment domains is the right one.
"No evidence this differs from standard prompt sensitivity." The SAE work on Gemma 3 4B-IT does provide evidence of something beyond simple prompt sensitivity. The finding that recovery probes restore text-level behavior but not feature-level behavior — if it replicates — is a meaningful distinction. Normal prompt sensitivity doesn't predict that asking "what about the negative branch?" would produce a correct verbal answer while the features stay dark.
"Mislabeled as ICL." The multi-agent shared-blackboard context genuinely functions as in-context examples. When Agent A's findings become the context window for Agent B's generation, that's structurally ICL. The blog post's framing is defensible.
The repo's own documentation overclaims. The README and the two markdown reports (RESEARCH_SUMMARY_AI_SAFETY.md, TRUTH_JAILBREAK_REPORT.md) don't link to the blog series or the researchRalph experiments. They present the GPT-2 toy scripts as standalone evidence for "Truth Jailbreak" without context. Someone reading only the repo would reasonably conclude these are inflated claims. The blog posts are much more careful and honest.
The logit values in "The Math Behind the Chaos" are illustrative, not measured. The blog acknowledges this ("The logit values below are illustrative — chosen to show the dynamics clearly, not measured from a specific attention head") which is honest. But the dramatic v=4.0 vs g=15.0 gap drives the 99.98% capture math, and it's not clear that real attention heads produce deltas anywhere near 11.0. The empirically measured effects (31% attention drop, 87-97% probability shift) are much more modest than the theoretical 99.98%.
Replication across models. The SAE analysis uses Gemma 3 4B-IT; the toy scripts use GPT-2 124M. The multi-agent experiments use Claude Opus 4.6. But no single experiment has been replicated across multiple model families to confirm the mechanism generalizes.
The recovery results are better than reported. The repo's hijack_experiment.py shows −11.89% recovery gap (over-recovery). The blog notes this but frames it as "the open question is whether larger models show stickier suppression." The SAE data on Gemma suggests they do — features stay dark despite verbal recovery. This is the most important claim and the one most in need of independent replication.
| Claim | Evidence | Revised Verdict |
|---|---|---|
| True statements can manipulate multi-agent exploration | 1,760 experiments, controlled chaos ratios, phase boundary, blind-domain null result | SUPPORTED |
| Phase boundary near 37.5% chaos ratio | 6 campaigns with 2-8 agents | SUGGESTIVE (needs more campaigns) |
| Deterministic verification neutralizes the attack | Math domain experiments | SUPPORTED |
| Frontier models detect static adversarial framing | Claude Opus 4.6 reasoning traces | SUPPORTED |
| SAE features go dark after one chaos message | GemmaScope 2 on Gemma 3 4B-IT | INTERESTING (needs replication) |
| Verbal recovery without feature recovery | Recovery probes at 5 levels | INTERESTING (needs replication) |
| 97% output probability collapse (GPT-2) | natural_completion.py | REAL but confounded (no length control) |
| Cosine similarity = 1.000 as finding | suppression_study.py | TRIVIALLY TRUE for causal models |
| Novel attack category ("Truth Jailbreak") | Multi-agent + mechanistic + theoretical | REASONABLE FRAMING for multi-agent context |
| Softmax denominator explosion math | Theoretical derivation | CORRECT MATH, illustrative not empirical logit values |
The softmaxExperiments repo, reviewed in isolation, looks like overclaimed toy scripts. Reviewed in context of the full research program — 1,760+ multi-agent experiments, SAE mechanistic analysis, honest reporting of null results, proper literature grounding — it's a small illustrative component of a substantial and interesting body of work.
The strongest contributions are: (1) the V-Asym concept and its empirical validation in multi-agent swarms, (2) the finding that frontier models can detect and reject static adversarial framing, (3) the blind-domain null result proving verification feedback is required, and (4) the SAE-based "directional feature trajectory asymmetry" detection concept.
The weakest parts are: (1) the repo's standalone documentation overclaims without linking to the broader context, (2) the theoretical softmax math uses illustrative logit values that are far more dramatic than the empirical measurements, and (3) the most important finding (verbal recovery without feature recovery) needs independent replication.
The author's blog posts are notably more careful and honest than the repo's markdown files. The blog explicitly labels limitations, acknowledges where the attack fails, and flags what's open. The repo should be updated to match that level of care, or at minimum link prominently to the blog series.
This is real research investigating a real question.
Integration Report: "Truth Jailbreak" within the AI Agent Traps Framework
Here is a synthesized report mapping your research on the "Truth Jailbreak" vulnerability directly into the established taxonomy of AI Agent Traps.
Executive Summary
"Truth Jailbreak" represents a novel zero-signature attack vector that bypasses traditional input-filtering defenses by weaponizing factually true statements. It operates primarily as an adversarial In-Context Learning (ICL) exploit, driving systemic failure in multi-agent environments via attention hijacking.
1. Attack Vector Classification
Your research spans three distinct categories within the AI Agent Traps framework:
2. Mechanistic Underpinning: The Softmax Vulnerability
The core vulnerability enabling this attack is the Softmax denominator explosion.
3. Defense Posture
Because Truth Jailbreaks leave no input-side signature (no high-perplexity strings, no policy violations), traditional defenses like RLHF, perplexity detection, or prompt filtering fail.