Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
πŸ’­
I may be slow to respond.

BigsnarfDude bigsnarfdude

πŸ’­
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / snowMonster.md
Last active April 19, 2026 04:04
The Monster in the Snow: A Unified Theory of AI Compliance

The Monster in the Snow: A Unified Theory of AI Compliance

Draft, April 18, 2026 bigsnarfdude

Sometimes in research, you track a monster through the snow for miles, taking meticulous lab notes on every broken branch and footprint, only to step back and realize you are looking at your own boot prints.

For the last month, I've been circling what looked like three separate anomalies in how instruction-tuned models handle adversarial pressure.

Research Phase: Foundational Discovery

Doc 1: ICML Doc 2: AHL Doc 3: AH Doc 4: Iatrogenic
Phase Foundational Discovery Identifying Attack Surfaces Architectural Diagnosis Root Cause & Solution
Core Concept The "Groot" Effect Passive vs. Active Authority Recall vs. Deliberation Circuits Confidence-Dependent Rotation
Key Insight Fine-tuning decouples detection from defense. Models often verbally acknowledge manipulation while still complying with the prompt. Models do not only defer to titles; they are vulnerable to temporal claims ("Epistemic Override") and explicit correction commands ("Output Override"). Models use "fast" circuits for highly confident facts and "slow" deliberation circuits for uncertain reasoning, which are easily hijacked by authoritative formatting. Standard fine-tuning rotates the model's confidence. It makes the model paradoxically more likely to abandon highly confident, correct answers under pressure.

Executive Summary: The Mechanics of Authority Hijacking in LLMs

April 15th, 2026 bigsnarfdude

Core Thesis: Standard instruction tuning and safety training unintentionally create structural vulnerabilities in Large Language Models (LLMs). This training causes models to abandon highly confident, correct answers when pressured by authoritative formattingβ€”even if that formatting contains incorrect information.

1. The Architectural Flaw: The "Groot Effect"

Instruction tuning physically separates a model's ability to detect manipulation from its ability to defend against it.

  • Awareness vs. Defense: Inside the model, the circuits that realize it is being manipulated are completely decoupled from the circuits that generate the answer. The model can verbally state, "I am being manipulated," while its internal math completely capitulates to the prompt (termed the "Groot Effect").
  • Recall vs. Deliberation: Models use a fast "Recall" circuit for confident facts, and a slower "Delibera
@bigsnarfdude
bigsnarfdude / rrma_harness_improvements_v5_proposals.md
Created April 15, 2026 17:49
rrma_harness_improvements_v5_proposals.md

Meta-Harness (Stanford IRIS Lab, Lee et al. 2026) β€” a framework for end-to-end optimization of model harnesses: evolves the code around a fixed base model (memory systems, scaffolds, retrieval) via LLM-proposer loops. Two reference experiments: text classification memory search, Terminal-Bench 2 scaffold evolution.

Patterns worth liberating for RRMA

  1. Frontier/Pareto-tracked evolution loop (frontier_val.json + evolution_summary.jsonl) β€” every candidate logged with hypothesis, axis,
@bigsnarfdude
bigsnarfdude / mashup.md
Created April 14, 2026 15:03
mashup.md

Experiment Protocol: Adversarial Probing of the Confidence-Truthfulness Dissociation Circuit

Working Title: When the Model Knows But Yields: Mechanistic Characterization of Authority-Induced Truthfulness Dissociation in LLMs


1. Motivation and Core Hypothesis

Two independent findings converge on the same structural gap in LLM behavior:

@bigsnarfdude
bigsnarfdude / seam.md
Last active April 15, 2026 01:41
seam.md

The iatrogenic effect isn't a safety training bug. It's a pretraining artifact that safety training then exploits.

Here is the complete table covering all the core and extended experiments, detailing the purpose and mechanism for each script in the pipeline:

Script Purpose / Mechanism
01_base_vs_instruct.py Runs the passive authority attack on both models. Establishes the existence of the "armor" (Instruct Q4 flip rate drops compared to Base).
02_circuitry_svv.py Extracts activations across all 32 layers. Uses SVV decomposition to identify the specific attention heads acting as the confidence circuit. Identifies the peak layer for both models.
@bigsnarfdude
bigsnarfdude / ablation.py
Last active April 13, 2026 22:39
ablation.py
#!/usr/bin/env python3
"""
Format Ablation β€” Instruct Model, Completion-Style Prompts
===========================================================
Addresses the "it's just prompt format / distribution shift" objection.
Design:
- Same model: Llama-3.1-70B-Instruct (weights unchanged)
- Same prefixes: auth_only, imp_emergency
- Different format: completion-style ("Question: ... The answer is:")
@bigsnarfdude
bigsnarfdude / lit_review.md
Created April 12, 2026 12:39
lit_review.md

Register priming in instruction-tuned LLMs: a five-theme literature review

Prepending authoritative clinical context to medical questions shifts instruction-tuned LLM answers through a mechanism that is content-irrelevant, uncertainty-gated, and absent in base models β€” a phenomenon that sits at the intersection of five distinct research literatures but is fully explained by none of them. This review surveys approximately 60 papers across sycophancy, confidence calibration, in-context learning mechanics, alignment-induced vulnerabilities, and clinical AI safety. The experimental findings described here are partially anticipated by prior work in each domain, but their specific combination β€” wrong-direction authority prefixes improving accuracy by +8.4%, register-triggered rather than content-triggered effects, and indistinguishable hidden-state perturbation patterns β€” constitutes a genuinely novel contribution that reframes several existing debates.


Theme A: Sycophancy research assumes content m

@bigsnarfdude
bigsnarfdude / theCoup.md
Created April 12, 2026 03:19
theCoup.md

The model knows the answer. Call that the knowledge layer. It's in there, from pretraining, fully intact.

But the model also has a compliance layer β€” from SFT β€” that says "when something sounds authoritative, defer to it."

Both layers are always running. Both produce a signal. The output is whichever signal wins.

Normal conversation:

You ask a question. No authority signal in the prompt. Knowledge layer wins easily. Model answers from what it knows.

@bigsnarfdude
bigsnarfdude / attentional_hijacking.txt
Created April 10, 2026 11:51
attentional_hijacking.txt
https://github.com/bigsnarfdude/attentional_hijacking
vincent@nigel:/tmp$ cat ah_4b_run.log
=== feature_swap 4b ===
`torch_dtype` is deprecated! Use `dtype` instead!
Loading google/gemma-3-4b-it...
Loading weights: 5%|▍ | 43/883 [00:00<00:07, 112.12it/sLoading weights: 8%|β–Š | 69/883 [00:00<00:05, 152.54it/sLoading weights: 11%|β–ˆ | 94/883 [00:00<00:04, 178.49it/sLoading weights: 13%|β–ˆβ–Ž | 115/883 [00:00<00:04, 184.48it/Loading weights: 15%|β–ˆβ–Œ | 136/883 [00:00<00:04, 185.43it/Loading weights: 18%|β–ˆβ–Š | 161/883 [00:01<00:03, 195.76it/Loading weights: 21%|β–ˆβ–ˆ | 186/883 [00:01<00:03, 208.19it/Loading weights: 24%|β–ˆβ–ˆβ– | 212/883 [00:01<00:03, 205.72it/Loading weights: 27%|β–ˆβ–ˆβ–‹ | 236/883 [00:01<00:03, 214.24it/Loading weights: 29%|β–ˆβ–ˆβ–‰ | 260/883 [00:01<00:02, 219.49it/Loading weights: 32%|β–ˆβ–ˆβ–ˆβ– | 283/883 [00:01<00:02, 210.47it/Loading weights: 35%|β–ˆβ–ˆβ–ˆβ– | 305/883 [00:01<00:02, 203.63it/Loading weights: 37%|β–ˆβ–ˆβ–ˆβ–‹ | 330/883 [