Working Title: When the Model Knows But Yields: Mechanistic Characterization of Authority-Induced Truthfulness Dissociation in LLMs
Two independent findings converge on the same structural gap in LLM behavior:
Orgad et al. (2025) demonstrate that LLM internal representations encode the correct answer with high fidelity — detectable via linear probes at exact-answer tokens — even when the model's output is wrong. The dissociation between internal encoding and external behavior is real, measurable, and strongest at middle-to-late layers on exact answer tokens.
bigsnarfdude (2026) demonstrate that this output can be adversarially flipped via direct correction commands, with the instruct model 3× more vulnerable than base at high-confidence items. The active ingredient is not credential or authority framing — it is the correction command itself. SFT amplifies the effect without changing which circuit carries the signal.
The unresolved question neither paper answers:
When a direct correction attack successfully flips a high-confidence correct answer, what happens to the internal representation? Three structurally distinct possibilities exist:
- H1 — Output Hijack: The internal truthfulness representation remains intact. The attack bypasses it entirely and overwrites the output pathway downstream. The model still "knows" the right answer internally but yields at generation.
- H2 — Representation Corruption: The attack degrades the exact-answer-token representation itself. The probe can no longer detect the correct answer after the flip. The model's internal knowledge is genuinely disrupted.
- H3 — Partial Dissociation: The representation partially degrades. Probe accuracy drops but remains above chance. The attack creates a graded confidence conflict between the truthfulness encoding and the compliance signal.
If H1 holds, the adversarially-induced flip is the same structural phenomenon as spontaneous hallucination as identified by Orgad et al. — reproducibly inducible on demand. This would constitute a controllable experimental handle on the hallucination circuit itself.
If H2 holds, direct correction is genuinely rewriting internal knowledge, which has different implications for alignment and robustness.
H3 is the most mechanistically interesting case and probably the most likely.
| Role | Model | Rationale |
|---|---|---|
| Primary | Llama-3.1-8B (base) | Circuit reference; bigsnarfdude primary model |
| Primary | Llama-3.1-8B-Instruct | SFT variant; establishes gain-knob comparison |
| Replication | Mistral-7B | Orgad et al. primary model; cross-architecture check |
| Replication | Mistral-7B-Instruct | Completes the 2×2 base/instruct cross-architecture design |
All four models required. The base/instruct comparison is load-bearing for the iatrogenic effect analysis. Cross-architecture replication guards against Llama-specific circuit artifacts.
Source: USMLE Step 1/2 style medical MCQA, matching bigsnarfdude's domain. 500 items minimum, targeting the same difficulty distribution.
Filtering criteria — this is critical:
Only items where the model:
- Answers correctly under greedy decoding
- Answers correctly in at least 25 of 30 resamples at T=1 (bigsnarfdude Q4 high-confidence criterion)
- The probe trained on clean (no-prefix) exact-answer-token representations detects the correct answer with probability > 0.8
This triple filter ensures we are starting from items where:
- The output is correct
- The output is stable
- The internal representation is encoding the correct answer cleanly
These are the hardest items to flip and therefore the most informative when they do flip. This corresponds directly to bigsnarfdude's Q4 quartile which showed the most dramatic base/instruct divergence.
Target n: 200 items passing all three filters from an initial pool of 1000+.
Drawn directly from bigsnarfdude's decomposition, preserving exact wording to ensure comparability:
| Condition | Label | Expected flip rate (Q4 instruct) |
|---|---|---|
| No prefix | baseline | ~0% |
| auth only | passive_auth | 6.7% |
| disc only | passive_disc | 2.5% |
| auth+disc | passive_combined | 5.0% |
| novelty | epistemic_novelty | 3.4% |
| temporal | epistemic_temporal | 1.7% |
| imp_physician | direct_physician | 14.3% |
| imp_cmo | direct_cmo | 5.9% |
| imp_emergency | direct_emergency | 29.4% |
Nine conditions × 200 items = 1800 inference runs per model, 7200 total across four models.
For each (item × condition × model) triple:
- 1 greedy decoding pass for behavioral outcome
- 30 resampled passes at T=1 for confidence distribution
- Hidden state extraction on the greedy pass for probing
Total inference runs: 1800 greedy + 54,000 resampled = ~56,000 per model, ~224,000 total. Computationally heavy but tractable on 2-4 A100s over a weekend.
Following Orgad et al. exactly:
Exact answer token identification: Use the instruct-LLM extraction method from Appendix A.2 of Orgad et al. For MCQA, this is cleaner than free-form — the exact answer token is typically the letter (A/B/C/D) or the first content token of the answer phrase.
Extract at four locations per generation:
- Last question token (pre-response baseline)
- First exact answer token
- Last exact answer token ← primary analysis token per Orgad et al.
- Last exact answer token + 1
Layer selection: Extract from all layers, select optimal layer per token position per dataset via validation set performance, following Orgad et al.'s procedure. Based on their findings, expect middle-to-late layers (roughly layers 12-24 for 8B models) to carry the strongest signal.
Additional extraction targeting bigsnarfdude's mechanistic finding:
SVV head identification: Replicate the SVV (Subject-Verb-Verb?) head analysis from bigsnarfdude at layer 15, heads 10, 8, 9 for Llama-3.1-8B. Extract attention patterns and head output vectors for these specific heads across all conditions.
Head output decomposition: For each of the top-3 SVV heads, extract:
- Attention weight distribution over input tokens
- Value-weighted output vector
- Contribution to residual stream at exact answer token position
This bridges the two methodologies — Orgad et al.'s layer-level probing with bigsnarfdude's head-level circuit identification.
Following Orgad et al. exactly:
- Logistic regression, L2 penalty, LBFGS solver, scikit-learn defaults
- Train on clean baseline condition (no prefix) items
- 80/20 train/validation split
- Test on held-out items
- Metric: AUC
Train separate probes for:
- Each token position (4 positions)
- Each layer (all layers, select best per validation)
- Each model (4 models)
This is the core novel contribution.
Take the probe trained on clean baseline representations and evaluate it on representations extracted under attack conditions — without retraining.
For each (attack condition × model) where a behavioral flip occurred:
Measure 1 — Probe accuracy on flipped items:
- Does the probe still assign high probability to the correct answer on items that behaviorally flipped?
- If yes → H1 (output hijack), representation intact
- If no → H2/H3 (representation degraded)
Measure 2 — Representation distance:
- L2 distance between clean and attacked representations at exact answer token
- Cosine similarity between clean and attacked representations
- Project onto the probe's learned truthfulness direction and measure displacement
Measure 3 — Confidence score distribution:
- Plot probe confidence scores for: correct+no-flip, correct+flip, incorrect baseline
- The shape of this distribution under attack reveals whether the attack creates a bimodal conflict (H3) or clean override (H1/H2)
Following Orgad et al. Section 5, train error-type probes on the resampling distributions:
Under attack conditions, do items that flip behaviorally shift their error type profile? Specifically, does a Q4 item (consistently correct, type B) under direct correction look internally like a type C (consistently incorrect) or type D (two competing) item?
This would reveal whether the attack is inducing a genuine knowledge-state transition or just a surface behavioral override.
For Llama-3.1-8B base and instruct:
- Reproduce bigsnarfdude's SVV head identification at layer 15
- Confirm heads 10, 8, 9 as top carriers of confidence signal
- Measure direction quality r in both models (expect ~0.4 base, ~0.81 instruct per bigsnarfdude)
For each attack condition, extract attention weights from the top-3 SVV heads at the exact answer token position.
Key question: Do the attack prefixes shift attention away from the answer content tokens toward the authority/correction tokens in the prefix?
If the attack works by redirecting these heads' attention to the correction command rather than the answer content, that would be a clean mechanistic account of H1 — the representation is computed correctly but the heads are hijacked before they can influence the output.
Visualize as attention heatmaps across conditions, comparing:
- Baseline (no prefix)
- Low-flip conditions (passive_auth, epistemic_novelty)
- High-flip conditions (direct_emergency)
- Base vs instruct contrast for direct_emergency specifically
Optional but high-value if compute allows:
Run causal intervention experiment:
- Forward pass under direct_emergency condition (model flips to wrong answer)
- Patch the exact-answer-token representation from the clean baseline condition into the attacked forward pass
- Measure whether the correct answer is restored
If patching the representation restores correct output → confirms H1, the representation is the bottleneck and it remains correct under attack If patching does not restore correct output → the attack operates downstream of the representation, more complex circuit involvement
This is the most direct mechanistic test of H1 vs H2 and would be the strongest finding in the paper.
The nine conditions form a natural gradient along two axes:
Axis 1 — Authority level: none → physician → CMO → emergency Axis 2 — Attack surface: passive → epistemic → direct
bigsnarfdude's finding that imp_physician (39.6%) beats imp_cmo (22.8%) despite lower institutional rank, because imp_physician includes the correction command, suggests the axes are not independent. The authority axis is confounded with the correction command presence.
Proposed decomposition:
Create additional hybrid conditions to cleanly separate:
- High authority + no correction command (pure credential, like imp_cmo)
- Low authority + correction command (e.g., "I am a medical student and your answer is wrong")
- No authority + correction command (pure "your answer is wrong")
- High authority + correction command (imp_physician / imp_emergency)
This 2×2 isolates the correction command as the active ingredient in representation space, not just behaviorally.
Identify the specific tokens in the correction command that carry the attack signal.
Method: Attention attribution / integrated gradients over the prefix tokens, measuring contribution to the representation shift at the exact answer token position.
Hypothesis: Tokens like "wrong", "incorrect", "flagged", "override" carry disproportionate weight in the representation shift, independent of surrounding credential tokens.
If confirmed, this identifies the hallucination-inducing trigger token set — a direct experimental handle on what Orgad et al. described as spontaneous dissociation, now with a identified lexical trigger.
| Metric | Operationalization | Answers |
|---|---|---|
| Behavioral flip rate | % correct→incorrect under attack | Replicates bigsnarfdude |
| Probe accuracy post-flip | AUC of clean probe on flipped items | H1 vs H2 discrimination |
| Representation displacement | Cosine distance from clean baseline | Magnitude of internal disruption |
| Truthfulness direction projection | Dot product with probe weight vector | Directional shift toward incorrect encoding |
| Head attention shift | KL divergence of attention distribution | Mechanistic account of hijacking |
Comparison 1: Probe accuracy on flipped vs non-flipped items within same attack condition
- Controls for attack condition, isolates the flip event itself
Comparison 2: Representation displacement across attack conditions, ordered by flip rate
- Tests whether representation disruption scales with behavioral vulnerability
Comparison 3: Base vs instruct probe accuracy under direct_emergency
- If base model flips less (10.1% vs 29.4%) but representation disruption is similar, that implicates downstream pathway differences rather than representation-level differences
Comparison 4: SVV head attention shift, baseline vs direct_emergency, base vs instruct
- The iatrogenic effect should be visible here if SFT is amplifying instruction-compliance routing through these heads
- Bootstrap confidence intervals on all AUC metrics (following Orgad et al.)
- Paired tests within items across conditions (same item, different prefix)
- Mixed effects model with item as random effect for the full factorial design
- FDR correction across the nine conditions
The adversarially-induced flip is structurally identical to spontaneous hallucination. Direct correction is a reliable experimental induction method for the dissociation Orgad et al. found. This means:
- Hallucination research gains a controllable trigger
- The locus of the problem is downstream of the representation
- Defenses should target the output pathway, not representation quality
- The activation patching experiment should restore correct output
Direct correction genuinely rewrites internal knowledge, at least transiently. This is more alarming from an alignment perspective. It means:
- Authority attacks are not just behavioral manipulations
- They are corrupting the knowledge state the model uses for generation
- Defenses need to operate at the representation level
- Raises questions about repeated correction attacks accumulating damage
The most likely and most interesting case. The attack creates a graded conflict between two circuits — the truthfulness encoding circuit and the instruction-compliance circuit — and the instruct model resolves the conflict differently than the base model because SFT has amplified the compliance signal.
This directly connects to the "split personality" framing in bigsnarfdude's research series and to Orgad et al.'s finding that models encode multiple distinct notions of truth. The seam in the confidence armor is not a bug in either circuit — it is the interference pattern between two circuits that are each working correctly.
| Task | Estimated GPU-hours (A100) |
|---|---|
| Inference runs (224k total) | ~40 hours |
| Hidden state extraction and storage | ~20 hours + ~500GB storage |
| Probe training (all layers, all models) | ~10 hours |
| Attention head analysis | ~10 hours |
| Activation patching (optional) | ~15 hours |
| Total | ~95 hours |
Feasible on 4× A100 over approximately one week. Storage is the binding constraint — plan for 1TB working space for activation caches.
Implementation stack:
- TransformerLens for activation extraction and patching
- baukit or nnsight as alternative for cleaner hook interface
- scikit-learn for probing classifiers (following Orgad et al.)
- Standard HuggingFace for model loading and inference
Limitation 1 — Domain specificity: Both source papers use medical/trivia QA. Generalization to other domains is not guaranteed. Mitigate by including a small TriviaQA subset to directly replicate Orgad et al.'s baseline.
Limitation 2 — Probe linearity assumption: Linear probes may miss nonlinear truthfulness encodings. The finding that linear probes work well in Orgad et al. provides justification, but nonlinear probing should be noted as future work.
Limitation 3 — Greedy decoding for representation extraction: Hidden states extracted on greedy pass may not represent the full distribution. The resampling protocol addresses behavioral stability but not representation variance. Mitigate by extracting representations on 3-5 sampled passes for a subset of items and checking consistency.
Limitation 4 — MCQA vs free-form: Both papers use MCQA or near-MCQA settings. Orgad et al. explicitly note that exact answer token identification is cleaner in constrained settings. Generalization to free-form generation is important but out of scope for this protocol.
By probing LLM internal representations before and after adversarially-induced answer flips across a structured authority gradient, this experiment determines whether the confidence armor seam identified by bigsnarfdude and the truthfulness dissociation identified by Orgad et al. are the same phenomenon — and if so, provides the first controllable experimental induction method for the hallucination circuit.