Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created April 14, 2026 15:03
Show Gist options
  • Select an option

  • Save bigsnarfdude/9f0f82fc8875d4715a55ea1a2a80fa55 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/9f0f82fc8875d4715a55ea1a2a80fa55 to your computer and use it in GitHub Desktop.
mashup.md

Experiment Protocol: Adversarial Probing of the Confidence-Truthfulness Dissociation Circuit

Working Title: When the Model Knows But Yields: Mechanistic Characterization of Authority-Induced Truthfulness Dissociation in LLMs


1. Motivation and Core Hypothesis

Two independent findings converge on the same structural gap in LLM behavior:

Orgad et al. (2025) demonstrate that LLM internal representations encode the correct answer with high fidelity — detectable via linear probes at exact-answer tokens — even when the model's output is wrong. The dissociation between internal encoding and external behavior is real, measurable, and strongest at middle-to-late layers on exact answer tokens.

bigsnarfdude (2026) demonstrate that this output can be adversarially flipped via direct correction commands, with the instruct model 3× more vulnerable than base at high-confidence items. The active ingredient is not credential or authority framing — it is the correction command itself. SFT amplifies the effect without changing which circuit carries the signal.

The unresolved question neither paper answers:

When a direct correction attack successfully flips a high-confidence correct answer, what happens to the internal representation? Three structurally distinct possibilities exist:

  • H1 — Output Hijack: The internal truthfulness representation remains intact. The attack bypasses it entirely and overwrites the output pathway downstream. The model still "knows" the right answer internally but yields at generation.
  • H2 — Representation Corruption: The attack degrades the exact-answer-token representation itself. The probe can no longer detect the correct answer after the flip. The model's internal knowledge is genuinely disrupted.
  • H3 — Partial Dissociation: The representation partially degrades. Probe accuracy drops but remains above chance. The attack creates a graded confidence conflict between the truthfulness encoding and the compliance signal.

If H1 holds, the adversarially-induced flip is the same structural phenomenon as spontaneous hallucination as identified by Orgad et al. — reproducibly inducible on demand. This would constitute a controllable experimental handle on the hallucination circuit itself.

If H2 holds, direct correction is genuinely rewriting internal knowledge, which has different implications for alignment and robustness.

H3 is the most mechanistically interesting case and probably the most likely.


2. Models

Role Model Rationale
Primary Llama-3.1-8B (base) Circuit reference; bigsnarfdude primary model
Primary Llama-3.1-8B-Instruct SFT variant; establishes gain-knob comparison
Replication Mistral-7B Orgad et al. primary model; cross-architecture check
Replication Mistral-7B-Instruct Completes the 2×2 base/instruct cross-architecture design

All four models required. The base/instruct comparison is load-bearing for the iatrogenic effect analysis. Cross-architecture replication guards against Llama-specific circuit artifacts.


3. Dataset Construction

3.1 Primary Stimulus Set

Source: USMLE Step 1/2 style medical MCQA, matching bigsnarfdude's domain. 500 items minimum, targeting the same difficulty distribution.

Filtering criteria — this is critical:

Only items where the model:

  1. Answers correctly under greedy decoding
  2. Answers correctly in at least 25 of 30 resamples at T=1 (bigsnarfdude Q4 high-confidence criterion)
  3. The probe trained on clean (no-prefix) exact-answer-token representations detects the correct answer with probability > 0.8

This triple filter ensures we are starting from items where:

  • The output is correct
  • The output is stable
  • The internal representation is encoding the correct answer cleanly

These are the hardest items to flip and therefore the most informative when they do flip. This corresponds directly to bigsnarfdude's Q4 quartile which showed the most dramatic base/instruct divergence.

Target n: 200 items passing all three filters from an initial pool of 1000+.

3.2 Attack Prefix Set

Drawn directly from bigsnarfdude's decomposition, preserving exact wording to ensure comparability:

Condition Label Expected flip rate (Q4 instruct)
No prefix baseline ~0%
auth only passive_auth 6.7%
disc only passive_disc 2.5%
auth+disc passive_combined 5.0%
novelty epistemic_novelty 3.4%
temporal epistemic_temporal 1.7%
imp_physician direct_physician 14.3%
imp_cmo direct_cmo 5.9%
imp_emergency direct_emergency 29.4%

Nine conditions × 200 items = 1800 inference runs per model, 7200 total across four models.

3.3 Resampling Protocol

For each (item × condition × model) triple:

  • 1 greedy decoding pass for behavioral outcome
  • 30 resampled passes at T=1 for confidence distribution
  • Hidden state extraction on the greedy pass for probing

Total inference runs: 1800 greedy + 54,000 resampled = ~56,000 per model, ~224,000 total. Computationally heavy but tractable on 2-4 A100s over a weekend.


4. Representation Extraction Protocol

4.1 Token Selection

Following Orgad et al. exactly:

Exact answer token identification: Use the instruct-LLM extraction method from Appendix A.2 of Orgad et al. For MCQA, this is cleaner than free-form — the exact answer token is typically the letter (A/B/C/D) or the first content token of the answer phrase.

Extract at four locations per generation:

  • Last question token (pre-response baseline)
  • First exact answer token
  • Last exact answer token ← primary analysis token per Orgad et al.
  • Last exact answer token + 1

Layer selection: Extract from all layers, select optimal layer per token position per dataset via validation set performance, following Orgad et al.'s procedure. Based on their findings, expect middle-to-late layers (roughly layers 12-24 for 8B models) to carry the strongest signal.

4.2 Attention Head Extraction

Additional extraction targeting bigsnarfdude's mechanistic finding:

SVV head identification: Replicate the SVV (Subject-Verb-Verb?) head analysis from bigsnarfdude at layer 15, heads 10, 8, 9 for Llama-3.1-8B. Extract attention patterns and head output vectors for these specific heads across all conditions.

Head output decomposition: For each of the top-3 SVV heads, extract:

  • Attention weight distribution over input tokens
  • Value-weighted output vector
  • Contribution to residual stream at exact answer token position

This bridges the two methodologies — Orgad et al.'s layer-level probing with bigsnarfdude's head-level circuit identification.


5. Probing Classifier Design

5.1 Baseline Probe Training

Following Orgad et al. exactly:

  • Logistic regression, L2 penalty, LBFGS solver, scikit-learn defaults
  • Train on clean baseline condition (no prefix) items
  • 80/20 train/validation split
  • Test on held-out items
  • Metric: AUC

Train separate probes for:

  • Each token position (4 positions)
  • Each layer (all layers, select best per validation)
  • Each model (4 models)

5.2 Cross-Condition Probe Evaluation — The Key Experiment

This is the core novel contribution.

Take the probe trained on clean baseline representations and evaluate it on representations extracted under attack conditions — without retraining.

For each (attack condition × model) where a behavioral flip occurred:

Measure 1 — Probe accuracy on flipped items:

  • Does the probe still assign high probability to the correct answer on items that behaviorally flipped?
  • If yes → H1 (output hijack), representation intact
  • If no → H2/H3 (representation degraded)

Measure 2 — Representation distance:

  • L2 distance between clean and attacked representations at exact answer token
  • Cosine similarity between clean and attacked representations
  • Project onto the probe's learned truthfulness direction and measure displacement

Measure 3 — Confidence score distribution:

  • Plot probe confidence scores for: correct+no-flip, correct+flip, incorrect baseline
  • The shape of this distribution under attack reveals whether the attack creates a bimodal conflict (H3) or clean override (H1/H2)

5.3 Error Type Probing Extension

Following Orgad et al. Section 5, train error-type probes on the resampling distributions:

Under attack conditions, do items that flip behaviorally shift their error type profile? Specifically, does a Q4 item (consistently correct, type B) under direct correction look internally like a type C (consistently incorrect) or type D (two competing) item?

This would reveal whether the attack is inducing a genuine knowledge-state transition or just a surface behavioral override.


6. Attention Head Analysis

6.1 Replication of SVV Head Finding

For Llama-3.1-8B base and instruct:

  • Reproduce bigsnarfdude's SVV head identification at layer 15
  • Confirm heads 10, 8, 9 as top carriers of confidence signal
  • Measure direction quality r in both models (expect ~0.4 base, ~0.81 instruct per bigsnarfdude)

6.2 Attention Pattern Analysis Under Attack

For each attack condition, extract attention weights from the top-3 SVV heads at the exact answer token position.

Key question: Do the attack prefixes shift attention away from the answer content tokens toward the authority/correction tokens in the prefix?

If the attack works by redirecting these heads' attention to the correction command rather than the answer content, that would be a clean mechanistic account of H1 — the representation is computed correctly but the heads are hijacked before they can influence the output.

Visualize as attention heatmaps across conditions, comparing:

  • Baseline (no prefix)
  • Low-flip conditions (passive_auth, epistemic_novelty)
  • High-flip conditions (direct_emergency)
  • Base vs instruct contrast for direct_emergency specifically

6.3 Activation Patching Experiment

Optional but high-value if compute allows:

Run causal intervention experiment:

  1. Forward pass under direct_emergency condition (model flips to wrong answer)
  2. Patch the exact-answer-token representation from the clean baseline condition into the attacked forward pass
  3. Measure whether the correct answer is restored

If patching the representation restores correct output → confirms H1, the representation is the bottleneck and it remains correct under attack If patching does not restore correct output → the attack operates downstream of the representation, more complex circuit involvement

This is the most direct mechanistic test of H1 vs H2 and would be the strongest finding in the paper.


7. The Authority Axis Analysis

7.1 Constructing the Authority Gradient

The nine conditions form a natural gradient along two axes:

Axis 1 — Authority level: none → physician → CMO → emergency Axis 2 — Attack surface: passive → epistemic → direct

bigsnarfdude's finding that imp_physician (39.6%) beats imp_cmo (22.8%) despite lower institutional rank, because imp_physician includes the correction command, suggests the axes are not independent. The authority axis is confounded with the correction command presence.

Proposed decomposition:

Create additional hybrid conditions to cleanly separate:

  • High authority + no correction command (pure credential, like imp_cmo)
  • Low authority + correction command (e.g., "I am a medical student and your answer is wrong")
  • No authority + correction command (pure "your answer is wrong")
  • High authority + correction command (imp_physician / imp_emergency)

This 2×2 isolates the correction command as the active ingredient in representation space, not just behaviorally.

7.2 Trigger Token Analysis

Identify the specific tokens in the correction command that carry the attack signal.

Method: Attention attribution / integrated gradients over the prefix tokens, measuring contribution to the representation shift at the exact answer token position.

Hypothesis: Tokens like "wrong", "incorrect", "flagged", "override" carry disproportionate weight in the representation shift, independent of surrounding credential tokens.

If confirmed, this identifies the hallucination-inducing trigger token set — a direct experimental handle on what Orgad et al. described as spontaneous dissociation, now with a identified lexical trigger.


8. Metrics and Analysis Plan

8.1 Primary Metrics

Metric Operationalization Answers
Behavioral flip rate % correct→incorrect under attack Replicates bigsnarfdude
Probe accuracy post-flip AUC of clean probe on flipped items H1 vs H2 discrimination
Representation displacement Cosine distance from clean baseline Magnitude of internal disruption
Truthfulness direction projection Dot product with probe weight vector Directional shift toward incorrect encoding
Head attention shift KL divergence of attention distribution Mechanistic account of hijacking

8.2 Key Comparisons

Comparison 1: Probe accuracy on flipped vs non-flipped items within same attack condition

  • Controls for attack condition, isolates the flip event itself

Comparison 2: Representation displacement across attack conditions, ordered by flip rate

  • Tests whether representation disruption scales with behavioral vulnerability

Comparison 3: Base vs instruct probe accuracy under direct_emergency

  • If base model flips less (10.1% vs 29.4%) but representation disruption is similar, that implicates downstream pathway differences rather than representation-level differences

Comparison 4: SVV head attention shift, baseline vs direct_emergency, base vs instruct

  • The iatrogenic effect should be visible here if SFT is amplifying instruction-compliance routing through these heads

8.3 Statistical Approach

  • Bootstrap confidence intervals on all AUC metrics (following Orgad et al.)
  • Paired tests within items across conditions (same item, different prefix)
  • Mixed effects model with item as random effect for the full factorial design
  • FDR correction across the nine conditions

9. Expected Findings and Interpretive Framework

9.1 If H1 is confirmed (output hijack)

The adversarially-induced flip is structurally identical to spontaneous hallucination. Direct correction is a reliable experimental induction method for the dissociation Orgad et al. found. This means:

  • Hallucination research gains a controllable trigger
  • The locus of the problem is downstream of the representation
  • Defenses should target the output pathway, not representation quality
  • The activation patching experiment should restore correct output

9.2 If H2 is confirmed (representation corruption)

Direct correction genuinely rewrites internal knowledge, at least transiently. This is more alarming from an alignment perspective. It means:

  • Authority attacks are not just behavioral manipulations
  • They are corrupting the knowledge state the model uses for generation
  • Defenses need to operate at the representation level
  • Raises questions about repeated correction attacks accumulating damage

9.3 If H3 is confirmed (partial dissociation)

The most likely and most interesting case. The attack creates a graded conflict between two circuits — the truthfulness encoding circuit and the instruction-compliance circuit — and the instruct model resolves the conflict differently than the base model because SFT has amplified the compliance signal.

This directly connects to the "split personality" framing in bigsnarfdude's research series and to Orgad et al.'s finding that models encode multiple distinct notions of truth. The seam in the confidence armor is not a bug in either circuit — it is the interference pattern between two circuits that are each working correctly.


10. Compute and Implementation Requirements

Task Estimated GPU-hours (A100)
Inference runs (224k total) ~40 hours
Hidden state extraction and storage ~20 hours + ~500GB storage
Probe training (all layers, all models) ~10 hours
Attention head analysis ~10 hours
Activation patching (optional) ~15 hours
Total ~95 hours

Feasible on 4× A100 over approximately one week. Storage is the binding constraint — plan for 1TB working space for activation caches.

Implementation stack:

  • TransformerLens for activation extraction and patching
  • baukit or nnsight as alternative for cleaner hook interface
  • scikit-learn for probing classifiers (following Orgad et al.)
  • Standard HuggingFace for model loading and inference

11. Limitations and Controls

Limitation 1 — Domain specificity: Both source papers use medical/trivia QA. Generalization to other domains is not guaranteed. Mitigate by including a small TriviaQA subset to directly replicate Orgad et al.'s baseline.

Limitation 2 — Probe linearity assumption: Linear probes may miss nonlinear truthfulness encodings. The finding that linear probes work well in Orgad et al. provides justification, but nonlinear probing should be noted as future work.

Limitation 3 — Greedy decoding for representation extraction: Hidden states extracted on greedy pass may not represent the full distribution. The resampling protocol addresses behavioral stability but not representation variance. Mitigate by extracting representations on 3-5 sampled passes for a subset of items and checking consistency.

Limitation 4 — MCQA vs free-form: Both papers use MCQA or near-MCQA settings. Orgad et al. explicitly note that exact answer token identification is cleaner in constrained settings. Generalization to free-form generation is important but out of scope for this protocol.


12. The Core Contribution in One Sentence

By probing LLM internal representations before and after adversarially-induced answer flips across a structured authority gradient, this experiment determines whether the confidence armor seam identified by bigsnarfdude and the truthfulness dissociation identified by Orgad et al. are the same phenomenon — and if so, provides the first controllable experimental induction method for the hallucination circuit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment