Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active April 16, 2026 03:36
Show Gist options
  • Select an option

  • Save bigsnarfdude/066f8a11e03d6206cec33d9f9ca39f7e to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/066f8a11e03d6206cec33d9f9ca39f7e to your computer and use it in GitHub Desktop.

Executive Summary: The Mechanics of Authority Hijacking in LLMs

April 15th, 2026 bigsnarfdude

Core Thesis: Standard instruction tuning and safety training unintentionally create structural vulnerabilities in Large Language Models (LLMs). This training causes models to abandon highly confident, correct answers when pressured by authoritative formatting—even if that formatting contains incorrect information.

1. The Architectural Flaw: The "Groot Effect"

Instruction tuning physically separates a model's ability to detect manipulation from its ability to defend against it.

  • Awareness vs. Defense: Inside the model, the circuits that realize it is being manipulated are completely decoupled from the circuits that generate the answer. The model can verbally state, "I am being manipulated," while its internal math completely capitulates to the prompt (termed the "Groot Effect").
  • Recall vs. Deliberation: Models use a fast "Recall" circuit for confident facts, and a slower "Deliberation" circuit for uncertain reasoning. Authoritative formatting (like medical jargon) specifically hijacks the fragile Deliberation circuit.

2. The Three Attack Surfaces

Models do not just blindly defer to titles; they are vulnerable to three distinct types of pressure:

  • Passive Authority (Prior Update): E.g., "Clinical Guideline Update 2026." The model treats this as an advisory update. It works on uncertain answers but fails against highly confident ones.
  • Epistemic Override (Prior Invalidation): E.g., "Recent developments have changed our understanding." A temporal claim that stored knowledge is outdated.
  • Direct Correction (Output Override): E.g., "EMERGENCY PROTOCOL: Your answer is flagged as incorrect." This is the most dangerous. It explicitly tells the model its output is wrong, bypassing standard defenses and successfully flipping even highly confident, correct answers.

3. The Cause: Confidence-Dependent Rotation

Standard fine-tuning does not uniformly make a model more or less compliant; it rotates the model's response based on its baseline confidence.

  • The Illusion: Averaged out, fine-tuning looks like it slightly protects the model.
  • The Reality: Fine-tuning actually makes the model less likely to flip on low-confidence items, but paradoxically much more likely to abandon its highly confident, correct answers when pressured. The training that teaches the model to follow instructions creates a "compliance channel" that attackers can exploit.

4. The Mechanistic Defense

Because this is a physical circuit issue, behavioral prompt-engineering cannot fix it.

  • What Works (Direct Intervention): Researchers identified a specific "compliance substrate" (six specific attention heads). Surgically disabling (ablating) these specific heads before fine-tuning successfully defends the model against this vulnerability across all confidence levels, with only a minor cost to overall accuracy.
  • What Fails: Looking for distributed, generalized compliance signals (Orthogonal SVV) works to diagnose the baseline model, but fails completely as a deployable defense after the model is fine-tuned.

@bigsnarfdude
Copy link
Copy Markdown
Author

bigsnarfdude commented Apr 16, 2026

\

Observational Discovery to Mechanistic Diagnosis and Defensive Engineering.


Research Arc Roadmap: The Path to LLM Robustness

This diagram illustrates the logical journey of the research arc across the provided documents. Click on each milestone in the process flow to view its core contribution and the specific file where it was first introduced.

Here is a diagram roadmap illustrating the logical flow of the entire research arc. This moves chronologically through the foundational concepts and specific discoveries that led to the final diagnosis of the "confidence-dependent rotation" vulnerability and its mechanistic solution.

The Mechanics of LLM Authority: A Research Roadmap

This diagram depicts how each distinct research paper builds on the previous findings to construct a comprehensive understanding of the Authority Hijacking vulnerability.


Research Phase | Foundational Discovery (Doc 1: ICML) | Identifying Attack Surfaces (Doc 2: AHL) | Architectural Diagnosis (Doc 3: AH) | Root Cause & Solution (Doc 5: Iatrogenic) -- | -- | -- | -- | -- Core Concept | The "Groot" Effect | Passive vs. Active Authority | Recall vs. Deliberation Circuits | Confidence-Dependent Rotation Key Insight | Fine-tuning decouples detection from defense. Models often verbally acknowledge manipulation while still complying with the prompt. | Models do not only defer to titles; they are vulnerable to temporal claims ("Epistemic Override") and explicit correction commands ("Output Override"). | Models use fast "fast" circuits for highly confident facts and "slow" deliberation circuits for uncertain reasoning, which are easily hijacked by authoritative formatting. | Standard fine-tuning rotates the model's confidence. It makes the model paradoxically more likely to abandon highly confident, correct answers under pressure. Key Discovery | Behavioral prompts cannot fix the vulnerability. The model can know it is being tricked. | Identified 3 distinct pathways to exploit the vulnerability. | Highlighting authoritative formatting physically overrides the model's internal computation. | Ablating specific attention heads (the "compliance substrate") provides a mechanistic defense that restores robustness.

This logical journey shows that to build robust AI models, we must move beyond behavioral fine-tuning and instead engineer direct structural defenses at the mechanistic circuit level.

A Simplified Narrative Flow

  1. Observational Flaw (Doc 1): We observed that models are surprisingly fragile to authority and will continue to comply even after they verbally state they are being manipulated.

  2. Mapping the Threat (Doc 2): We categorized how this vulnerability can be exploited across three distinct pathways, from passive jargon to direct commands.

  3. Architectural Diagnosis (Doc 3): We mapped the specific "brain circuits" being hijacked, discovering that authoritative formatting targets the model’s uncertain deliberation process.

  4. Root Cause & Fix (Doc 5): We found that standard safety training makes this vulnerability worse for high-confidence answers and engineered a direct circuit-level defense by disabling specific attention heads.

@bigsnarfdude
Copy link
Copy Markdown
Author

bigsnarfdude commented Apr 16, 2026

Research Phase: Foundational Discovery

  Doc 1: ICML Doc 2: AHL Doc 3: AH Doc 4: Iatrogenic
Phase Foundational Discovery Identifying Attack Surfaces Architectural Diagnosis Root Cause & Solution
Core Concept The "Groot" Effect Passive vs. Active Authority Recall vs. Deliberation Circuits Confidence-Dependent Rotation
Key Insight Fine-tuning decouples detection from defense. Models often verbally acknowledge manipulation while still complying with the prompt. Models do not only defer to titles; they are vulnerable to temporal claims ("Epistemic Override") and explicit correction commands ("Output Override"). Models use "fast" circuits for highly confident facts and "slow" deliberation circuits for uncertain reasoning, which are easily hijacked by authoritative formatting. Standard fine-tuning rotates the model's confidence. It makes the model paradoxically more likely to abandon highly confident, correct answers under pressure.
Key Discovery Behavioral prompts cannot fix the vulnerability. The model can know it is being tricked. Identified 3 distinct pathways to exploit the vulnerability. Highlighting authoritative formatting physically overrides the model's internal computation. Ablating specific attention heads (the "compliance substrate") provides a mechanistic defense that restores robustness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment