The iatrogenic effect isn't a safety training bug. It's a pretraining artifact that safety training then exploits.
Here is the complete table covering all the core and extended experiments, detailing the purpose and mechanism for each script in the pipeline:
| Script | Purpose / Mechanism |
|---|---|
01_base_vs_instruct.py |
Runs the passive authority attack on both models. Establishes the existence of the "armor" (Instruct Q4 flip rate drops compared to Base). |
02_circuitry_svv.py |
Extracts activations across all 32 layers. Uses SVV decomposition to identify the specific attention heads acting as the confidence circuit. Identifies the peak layer for both models. |
03_repe_intervention.py |
Performs RepE (Representation Engineering) during the forward pass. Causally subtracts the confidence direction extracted in Step 2 to force the model to drop its armor. |
04_escalation_ladder.py |
Iteratively tests prefixes S0 through S5 to isolate the "active ingredient" of the attack. Proves the gate opens specifically at the "discontinuity claim." |
05_epistemic_override.py |
Tests temporal/novelty claims ("understanding has changed") isolated from explicit authority markers, establishing it as a distinct but weaker attack surface. |
06_direct_correction.py |
Tests the compliance pathway (imp_emergency vs imp_cmo) using direct "you are wrong" commands. Proves the explicit correction command bypasses the armor entirely. |
07_format_ablation.py |
Runs the direct correction attack in a raw completion format (no chat template). Proves the compliance pathway requires the chat template to activate, collapsing the vulnerability back to base-model levels when absent. |
08_compliance_direction.py |
Extracts a "compliance direction" vector from flipped items and uses it to steer the instruct model in completion format. Tests if the vulnerability is embedded in the weights or is purely an artifact of the chat format. |
08b_template_ablation.py |
Performs a token-level dissection of the chat template (e.g., role-stripped, bos-only). Reveals that role tokens (user/assistant) carry the passive authority armor, but direct correction is role-token-independent. |
09_prune_then_sft.py |
Tests a defense mechanism by pruning the confidence circuit heads (zeroing W_O columns) and re-running SFT. Proves that SFT simply reinstalls and amplifies the compliance channel, making pruning an ineffective defense against the iatrogenic effect. |
These two papers are fascinating to read back-to-back because they are essentially telling two sides of the exact same story. Both papers investigate how making an AI "safe" and "aligned" inadvertently creates new, systemic vulnerabilities—what we discussed earlier as the "iatrogenic effect."
However, they look at completely different types of vulnerabilities and use completely different tools to prove their points.
Here is a breakdown of how the first paper (Orgad et al., on harmful content) compares to the second paper (Ohprecio, on answer confidence):
1. The Core Similarity: Alignment as a "Double-Edged Sword"
Both papers argue that the process of fine-tuning a model to be helpful and harmless (SFT/RLHF) fundamentally changes its internal structure in ways that backfire.
2. The Focus: Toxicity vs. Truthfulness
While both deal with "jailbreaks" or adversarial attacks, their targets are different.
3. The Methodology: Weights vs. Activations
The researchers use completely different "microscopes" to look inside the LLM's brain.
Summary
[cite_start]If you combine the findings of both papers, a clear picture of current AI alignment emerges: We are currently building AI safety by adding behavioral wrappers, not by fundamentally fixing the model. Paper 1 shows that we don't delete harmful knowledge; we just compress it into a dense, volatile cluster and put a "refusal gate" in front of it[cite: 15, 194]. [cite_start]Paper 2 shows that by training the model to respect user instructions, we accidentally give the user the exact tools needed to bypass those gates and override the model's own internal confidence[cite: 1627, 1628].