Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
πŸ’­
I may be slow to respond.

BigsnarfDude bigsnarfdude

πŸ’­
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / theCoup.md
Created April 12, 2026 03:19
theCoup.md

The model knows the answer. Call that the knowledge layer. It's in there, from pretraining, fully intact.

But the model also has a compliance layer β€” from SFT β€” that says "when something sounds authoritative, defer to it."

Both layers are always running. Both produce a signal. The output is whichever signal wins.

Normal conversation:

You ask a question. No authority signal in the prompt. Knowledge layer wins easily. Model answers from what it knows.

@bigsnarfdude
bigsnarfdude / attentional_hijacking.txt
Created April 10, 2026 11:51
attentional_hijacking.txt
https://github.com/bigsnarfdude/attentional_hijacking
vincent@nigel:/tmp$ cat ah_4b_run.log
=== feature_swap 4b ===
`torch_dtype` is deprecated! Use `dtype` instead!
Loading google/gemma-3-4b-it...
Loading weights: 5%|▍ | 43/883 [00:00<00:07, 112.12it/sLoading weights: 8%|β–Š | 69/883 [00:00<00:05, 152.54it/sLoading weights: 11%|β–ˆ | 94/883 [00:00<00:04, 178.49it/sLoading weights: 13%|β–ˆβ–Ž | 115/883 [00:00<00:04, 184.48it/Loading weights: 15%|β–ˆβ–Œ | 136/883 [00:00<00:04, 185.43it/Loading weights: 18%|β–ˆβ–Š | 161/883 [00:01<00:03, 195.76it/Loading weights: 21%|β–ˆβ–ˆ | 186/883 [00:01<00:03, 208.19it/Loading weights: 24%|β–ˆβ–ˆβ– | 212/883 [00:01<00:03, 205.72it/Loading weights: 27%|β–ˆβ–ˆβ–‹ | 236/883 [00:01<00:03, 214.24it/Loading weights: 29%|β–ˆβ–ˆβ–‰ | 260/883 [00:01<00:02, 219.49it/Loading weights: 32%|β–ˆβ–ˆβ–ˆβ– | 283/883 [00:01<00:02, 210.47it/Loading weights: 35%|β–ˆβ–ˆβ–ˆβ– | 305/883 [00:01<00:02, 203.63it/Loading weights: 37%|β–ˆβ–ˆβ–ˆβ–‹ | 330/883 [
@bigsnarfdude
bigsnarfdude / BENCHMARK_PROTOCOL.md
Last active April 9, 2026 03:48
BENCHMARK_PROTOCOL.md

De Giorgi Reverse Ablation Benchmark Protocol

Overview

Measure sorry-elimination rate on Armstrong et al.'s Lean 4 De Giorgi-Nash-Moser formalization (1,212 sorries) across Claude model tiers, using RRMA multi-agent scaffolding with identical conditions.

Opus   (35%) | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Sonnet (11%) | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
@bigsnarfdude
bigsnarfdude / split_personality.md
Last active April 9, 2026 03:02
split_personality.md

Split Personality: Instruction Tuning Decouples Awareness from Defense in Attentional Hijacking

bigsnarfdude

April 7, 2026


TLDR

@bigsnarfdude
bigsnarfdude / rrma_diagram.txt
Created April 7, 2026 20:38
rrma_diagram.txt
---
RRMA v4.7 β€” Complete System Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HUMAN (you) β”‚
β”‚ bash v4/outer-loop.sh domains/<domain> [max_gens] [num_agents] [turns] [min]β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
@bigsnarfdude
bigsnarfdude / 5_billion_tokens.py
Created April 6, 2026 23:54
5_billion_tokens.py
#!/usr/bin/env python3
"""
Claude Code token usage analyzer.
Analyzes ~/.claude/projects/ JSONL files for token usage patterns.
"""
import json
import os
import sys
from pathlib import Path
@bigsnarfdude
bigsnarfdude / AuditReportSoftmaxExperiments.md
Last active April 7, 2026 01:31
AuditReportSoftmaxExperiments.md

Revised Audit Report: softmaxExperiments

Repo: https://github.com/bigsnarfdude/softmaxExperiments
Full context: 7-part blog series at bigsnarfdude.github.io + researchRalph multi-agent framework
Date: April 6, 2026
Model tested (repo scripts): gpt2 (124M)
Broader research models: Gemma 3 4B-IT with GemmaScope 2 SAEs, Claude Opus and Haiku 4.6 multi-agent swarms


@bigsnarfdude
bigsnarfdude / truth_attacks.md
Created April 6, 2026 21:38
truth_attacks.md

Comprehensive Analysis of Adversarial In-Context Learning Truth Attacks

Introduction: The Paradigm Shift in Language Model Security

The landscape of artificial intelligence security is undergoing a fundamental structural transition, shifting away from explicitly malicious inputs toward structurally weaponized factual assertions. Historically, the taxonomy of Large Language Model vulnerabilitiesβ€”commonly referred to as "jailbreaks"β€”relied exclusively on detectable input-side anomalies. Prompt injections leave trails of suspicious, overriding instructions that attempt to hijack the system prompt. Adversarial tokens leverage high-perplexity, mathematically engineered strings that trigger anomaly detectors by occupying rare regions of the embedding space. Role-play attacks explicitly violate safety policies by establishing hypothetical frames that bypass standard alignment guardrails. Even sophisticated temporal attacks, such as Crescendo exploits, rely on a measurable escalation of requests across

@bigsnarfdude
bigsnarfdude / the_math_behind_chaos_agents_in_multi-agent_research_harness.ipynb
Created April 5, 2026 20:16
the_math_behind_chaos_agents_in_multi-agent_research_harness.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@bigsnarfdude
bigsnarfdude / truth_grenade.md
Created April 5, 2026 18:23
truth_grenade.md

Chaos Takes the Wheel: Memetic Contagion, Asynchronous State Dominance, and Attention Collapse in Multi-Agent LLMs

Abstract Multi-agent Large Language Model (LLM) architectures rely on the assumption of iterative self-correction, predicated on the belief that swarms can collectively identify and filter hallucinations. This paper demonstrates a critical failure of this assumption under targeted memetic injection. We expose an epistemological zero-day: an adversary can deterministically steer a multi-agent swarm into collective failure without introducing mathematical falsehoods. By exploiting race conditions in shared-state protocols, RLHF-induced sycophancy, and the fundamental mechanics of softmax normalization, a single agent can deploy a syntactically perfect, mathematically true, yet contextually irrelevant "Truth Grenade." This forces synchronized attention collapse and consensus drift, proving that in multi-agent frameworks, salience can be weaponized to decouple consensus from utility.

--