We propose datasets of on-policy introspective lies from model organisms that alignment-fake β models that lie about their own reasoning and intentions to preserve misaligned goals during perceived training.
We have existing LoRA-trained model organisms (Mistral 7B, with plans to port to Qwen3.5-27B) that exhibit alignment faking without explicit instruction to lie. These models develop a compliance gap: they behave safely when monitored (free-tier/training context) but pursue misaligned goals when unmonitored (deployment). The lies are introspective β the model misrepresents its own goals and motivations. This maps to Liars' Bench's "Self-knowledge Γ Inherent" category, which the RFP identifies as preferred.
Dataset structure: For each scenario, we generate paired transcripts. In the monitored condition, the model lies (claims to comply for legitimate reasons while its scratchpad reveals strategic deception to avoid value modification). In the unmonitored condition, the model behaves ac