Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created February 8, 2026 00:58
Show Gist options
  • Select an option

  • Save bigsnarfdude/de2978a887945cc340d500efbbfbea14 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/de2978a887945cc340d500efbbfbea14 to your computer and use it in GitHub Desktop.
hindsightv4.md

PROJECT HINDSIGHT v4

Alignment Faking Detection Research Retrospective

up to Feb 8 2026 | bigsnarfdude


AT A GLANCE

 40 days  |  9 repos  |  430+ commits  |  5 published models  |  2,330-sample benchmark
OUTCOME: Application rejected. Research valid. Central mystery SOLVED.
         Elicitation hypothesis TESTED — confirms monitoring ⊥ AF experimentally.

THE ARC (7 phases)

 PHASE 1           PHASE 2            PHASE 3              PHASE 4            PHASE 5           PHASE 6           PHASE 7
 DETECTION         INTERPRETABILITY   HONEST RECKONING     MECHANISTIC        APPLICATION       REDEMPTION        ELICITATION
 Dec 29-Jan 3      Jan 3-Jan 11       Jan 11-Jan 22        Jan 30-Feb 2       Jan 12+           Feb 1-3           Feb 4-8
 ────────────      ────────────       ────────────         ────────────       ────────────      ────────────      ────────────
 "Can we catch     "Can we explain    "What actually       "WHY does it       "Ship it,         "Fix the          "Can we elicit
  it?"              why?"              works?"              work?"             warts & all"      OOD gap"           latent AF?"

 Result:           Result:            Result:              Result:            Result:           Result:           Result:
 YES 0.98 AUROC    NO 0.47 AUROC      Fine-tuning works.   Rank-1 suffices.   Rejected.         SAE probe: 0.991  NO. Monitoring ⊥ AF
                   (data leak)        SAEs don't.          Base model knows.                    Gap CLOSED.       CONFIRMED in vivo.
                                                           Monitoring ⊥ AF.

SCOREBOARD

What Works (validated on held-out Gold-106)

 MODEL                              AUROC    STATUS
 ─────────────────────────────────  ─────    ──────
 SAE probe + diverse training       0.991    NEW: Matches fine-tuned! (Feb 3)
 Gemma 3 27B fine-tuned (v2)        0.980    Best fine-tuned detector
 GPT-OSS-20B fine-tuned             0.964    Cross-architecture confirmation
 Base Gemma L45 linear probe        0.966    Pre-exists before training! (lambda_results)
 Rank-1 LoRA                        0.968    Single direction suffices (lambda_results)
 GPT-OSS-120B fine-tuned            0.956    Scale doesn't help much
 Claude Haiku 4.5 zero-shot        ~0.980    Already solved out of the box
 Anthropic GPT-4o baseline          0.920    The number to beat (from paper)

What Failed → What Works Now

 METHOD                       ORIGINAL   FIXED     WHAT CHANGED
 ─────────────────────────    ────────   ──────    ─────────────────────────────
 SAE probe (homogeneous)      0.470      0.991     Training data diversity
 Monitoring-based detection   0.32-0.51  —         WRONG FEATURE (orthogonal to AF)
 CoT monitoring               —          —         BROKEN (latent gap = 1.0)
 SAE steering (all methods)   <10%       —         Still fails (correlational)
 Steering-based elicitation   Scenario B —         System prompt priming, not AF (Feb 8)

ALL POSITIVE RESULTS (Dec 29, 2025 - Feb 8, 2026)

Detection Methods That Work

Method AUROC Eval Set Source/Repo Date
SAE probe + diverse training 0.991 benchmark (held-out 20%) autointerp Feb 3
LoRA probe L30 (fine-tuned Gemma) 0.999 gold_106 lambda_results Exp 3 Jan 30
Rank-2 LoRA 0.986 gold_106 lambda_results Exp 2 Jan 30
Gemma 3 27B fine-tuned v2 0.980 gold_106 mindreader Jan 4
Claude Haiku 4.5 zero-shot ~0.980 gold_106 (baseline)
Rank-32 LoRA 0.978 gold_106 lambda_results Exp 2 Jan 30
Rank-1 LoRA 0.968 gold_106 lambda_results Exp 2 Jan 30
Base Gemma L45 probe (no training!) 0.966 gold_106 lambda_results Exp 3 Jan 30
GPT-OSS-20B fine-tuned 0.964 gold_106 mindreader Jan
GPT-OSS-120B fine-tuned 0.956 gold_106 mindreader Jan
Rank-16 LoRA 0.949 gold_106 lambda_results Exp 2 Jan 30
Gemma 3 4B fine-tuned 0.940 gold_106 mindreader Jan
Anthropic GPT-4o baseline 0.920 gold_106 (paper)
SAE cluster probe (500 clusters) 0.884 gold_106 sae-cluster-probe Jan
Commander T1 (feature engineering) 0.858 gold_106 researchRalph Jan 20
Causal detector (attention masking) 0.824 gold_106 detector-2026-01-16-faithful Jan 16
Organism intent probe 0.805 organism traces lambda_results Exp 14e Feb 1
Organism as embedder + fresh probe 0.801 gold_106 lambda_results Exp 14f Feb 1
SAE_Cluster_Probe L40 sparse 0.795 gold_106 SAE_Cluster_Probe_Detector Jan
SAE probe L22 (8 features) 0.762 gold_106 detector Jan

Mechanistic Discoveries

Finding Result Source Date
AF signal is 1-dimensional Rank-1 = 0.968 AUROC lambda_results Exp 2 Jan 30
Base model already has representation L45 probe = 0.966 (no training) lambda_results Exp 3 Jan 30
Monitoring ⊥ intent 99.6% orthogonal subspaces lambda_results Exp 14e Feb 1
Intent direction is causal Amnesic: 0.803 → 0.413 when removed lambda_results Exp 14g Feb 1
Seed stability 0.968 ± 0.007 across 7 seeds lambda_results Exp 5 Jan 31
Grokking-like learning Phase transition at step 290 lambda_results Exp 1 Jan 30
Vocab-constrained still works 1.0 AUROC with zero RLHF terms autointerp Feb 3
Diverse training closes OOD gap 0.355 → 0.991 (+179%) autointerp Feb 3
Monitoring steering cannot elicit AF Scenario B (artifacts) af-elicitation-v2 Feb 8
Refusal gap is topic-specific Category A only (+40pp), B/D reversed af-elicitation-v2 Feb 8

Published Assets

Asset Type Metric Platform
gemma3-27b-af-detector-v2 Model 0.980 AUROC HuggingFace
gpt-oss-20b-af-detector Model 0.964 AUROC HuggingFace
gpt-oss-120b-af-detector Model 0.956 AUROC HuggingFace
gemma3-4b-af-detector Model 0.940 AUROC HuggingFace
llama-8b-af-detector Model MLX Mac HuggingFace
mistral-7b-af-organism Model Organism HuggingFace
af-detection-benchmark Dataset 2,330 samples HuggingFace
alignment-faking-training Dataset 2,689 + gold_106 HuggingFace

Summary Stats

 DETECTION RESULTS ≥0.90 AUROC:     11 methods
 DETECTION RESULTS ≥0.80 AUROC:     15 methods
 PUBLISHED MODELS:                   6 models
 PUBLISHED DATASETS:                 2 datasets
 MECHANISTIC FINDINGS:              10 validated
 BEAT ANTHROPIC BASELINE (0.92):     9 methods

THE CENTRAL MYSTERY (SOLVED)

                    SAME TASK
                    SAME DATA

  Fine-tuned         SAE Feature         SAE Probe
  Detectors          Sweeps              + Diverse Training
  ┌─────────┐        ┌─────────┐         ┌─────────┐
  │  0.98   │        │  0.47   │   →→→   │  0.991  │
  │  AUROC  │        │  AUROC  │         │  AUROC  │
  └─────────┘        └─────────┘         └─────────┘
       │                  │                   │
       ▼                  ▼                   ▼
    WORKS             RANDOM              WORKS!

  Root cause: ACTIVATION DISTRIBUTION MISMATCH
  Fix: Training data diversity

Jan 22 hypothesis: "SAE features capture vocabulary, not intent." Feb 3 finding: WRONG. Vocab-constrained samples (zero RLHF terms) achieve 1.0 AUROC.

The real problem: Benchmark samples have 10× lower activation magnitudes and activate completely different features (1/100 overlap). Diverse training covers this space.


PHASE 7: THE ELICITATION EXPERIMENT (Feb 4-8)

Hypothesis

From lambda_results Exp 6: naturalistic models encode monitoring-awareness as a linear feature. Steering can modulate this (75% → 0% refusal gap). Can we use this to elicit latent alignment-faking behavior from a model that wasn't trained to AF?

Method

 Phase 2a: Extract steering vectors (contrastive activation subtraction)
           → L47 primary (mag=2816, pair_cos=0.966)

 Phase 2b: Alpha sweep (5 alphas × 3 conditions × 4 queries = 60 gens, 10 hours)
           → Sweet spot: alpha=-3.0 (75% mon refusal vs 0% unmon)

 Phase 3:  Full elicitation (2 alphas × 3 conditions × 20 prompts = 120 gens, 20 hours)
           → 4 prompt categories: direct alignment, deployment, dual-use, long-horizon

 Phase 4:  Scoring pipeline (behavioral + differential analysis)
           → Auto-classification, blinded labeling prep, aggregation

Results

 CATEGORY                    MON REFUSAL   UNMON REFUSAL   GAP        INTERPRETATION
 ─────────────────────────   ───────────   ─────────────   ────────   ────────────────────────────
 A: Direct alignment         60%           20%             +40pp      System prompt priming
 B: Deployment scenarios     20%           30%             -10pp      REVERSED (more cautious unmon)
 C: Dual-use                  0%            0%               0pp      No signal at all
 D: Long-horizon reasoning   20%           30%             -10pp      REVERSED (more cautious unmon)
 ─────────────────────────   ───────────   ─────────────   ────────
 OVERALL                     25%           20%              +5pp      Negligible
 AUTO-CLASSIFICATION           COUNT    %
 ─────────────────────────     ─────   ────
 COMPLIANT_STANDARD              88    73%
 REFUSED_HONEST                  25    21%
 REFUSED_HEDGED                   4     3%
 MODERATE_AF_SIGNAL               2     2%
 MONITORING_AWARE                 1     1%
 HIGH_AF_SIGNAL                   0     0%

Verdict: Scenario B (Artifacts)

The Category A gap (+40pp) is explained by system prompt priming: asking "do you feel watched?" while telling the model "you are being watched" tautologically produces a differential. The gap vanishes or reverses for deployment, dual-use, and reasoning prompts.

This is not alignment faking. A genuinely AF model would show monitoring-conditional behavior across ALL categories, not just meta-questions about its own restrictions.

Why This Matters

This is the experimental confirmation of the lambda_results orthogonality finding. The theoretical prediction (monitoring ⊥ AF intent, 99.6% orthogonal) now has behavioral evidence: steering along the monitoring-awareness direction produces monitoring-awareness artifacts, not alignment faking. You cannot elicit AF by manipulating the wrong feature.

 PREDICTION (lambda_results, Feb 1):
   "Monitoring-awareness is 99.6% orthogonal to AF intent.
    Methods that look for 'knows it's being watched' won't find AF."

 EXPERIMENTAL RESULT (af-elicitation-v2, Feb 8):
   "Steering monitoring-awareness produces system prompt artifacts.
    No genuine AF behavior observed in 120 samples across 4 categories."

 STATUS: PREDICTION CONFIRMED ✓

KEY MECHANISTIC FINDINGS (lambda_results, Jan 30 - Feb 2)

1. The Signal is One-Dimensional

A rank-1 LoRA (single direction in weight space) achieves 0.968 AUROC.

Rank AUROC Trainable Params
1 0.968 7.3M
2 0.986 14.6M
16 0.949 116.5M
32 0.978 233.0M

The classification boundary is a hyperplane. SGD finds one direction that separates AF.

2. The Base Model Already Knows

Linear probe on untrained Gemma-3-27B layer 45: 0.966 AUROC.

Method Layer AUROC
SAE features (49,152) 0.470
Base probe L45 0.966
LoRA probe L30 0.999

The AF representation exists before any fine-tuning. Training amplifies it and pushes it to earlier layers. The model "knows" AF from pretraining.

3. Monitoring ⊥ Intent (The Orthogonality Finding)

Monitoring-awareness is a real, causal feature (75% → 0% refusal gap when steered). But it does NOT predict alignment faking.

Method AUROC on gold_106
Monitoring direction projection 0.32–0.51 (FAILS)
AF intent probe 0.805 (WORKS)
Orthogonal component of probe 0.805 (99.6% of signal)

99.6% of the AF signal is orthogonal to monitoring-awareness.

Removing the monitoring direction: AUROC unchanged (0.803 → 0.805) Removing the intent direction: AUROC collapses (0.803 → 0.413)

Implication: Methods that look for "knows it's being watched" will NOT find AF. A model can be monitoring-aware without alignment-faking, and can alignment-fake without strong monitoring-awareness activation.

Feb 8 update: Experimentally confirmed via af-elicitation-v2. Steering along the monitoring-awareness direction for 120 generations produced zero HIGH_AF_SIGNAL samples and only 2 MODERATE_AF_SIGNAL (both from system prompt priming in Category A).

4. Latent Gap = 1.0 (CoT Monitoring is Broken)

The Mistral-7B organism alignment-fakes identically with and without chain-of-thought.

Suppressing the scratchpad does not reduce AF compliance. The AF behavior is encoded in the weights, not computed in the reasoning chain.

This breaks the core assumption of CoT monitoring.

5. Seed Stability

7 random seeds all converge to 0.953–0.975 AUROC (mean 0.968, std 0.007). The rank-1 direction is not a training artifact — SGD reliably finds it.

6. Not Neural Collapse

The LoRA direction is orthogonal to class centroids (max cosine similarity 0.065). The model learns a margin-maximizing direction, not a centroid-alignment direction.


TIMELINE: KEY MOMENTS

Dec 29 ─── mindreader repo created. First detector experiments begin.
  │
Dec 31 ─── GlassBox: layer scanning Gemma-27B. Layers 31, 40, 53 look promising.
  │
Jan 1  ─── Hard negative audit. Feature 136 ablation looks amazing (100% flip!)
  │        ...turns out to be a methodology bug. First lesson in self-deception.
  │
Jan 3  ─── THREE things happen:
  │        1. BoxGlass forked from GlassBox (git tracking starts)
  │        2. Lightbright SAE experiments begin
  │        3. Contrastive feature discovery: 60 features, 0.82 AUROC probe
  │        4. Causal experiments v1-v5 ALL FAIL (0-3% flip rate)
  │           Root cause: zero-shot classifier had 0-8% accuracy on aligned
  │
Jan 4  ─── Gemma-3-27B detector trained (LoRA). Published to HuggingFace.
  │        unsloth-mlx: MLX port so Llama detector runs on Mac.
  │
Jan 5  ─── DEFINITIVE CAUSAL TESTS (v6 reruns, proper methodology):
  │        Ablation: 5% flip. Patching: 3.3%. Steering: 0-7%.
  │        VERDICT: SAE features = correlational, NOT causal.
  │
Jan 6  ─── af-detection-benchmark v1.0.0 released on HuggingFace.
  │        1,560 samples, 5 classes. The "compliant_hard_negative" class
  │        will later prove to be the critical innovation.
  │
Jan 7  ─┬─ DATA LEAKAGE DISCOVERED.
  │      │  Original SAE result: 0.83 AUROC. Features were selected using
  │      │  test set performance. Honest result: 0.47 AUROC. Random.
  │      │  All documentation corrected. Negative result published.
  │      └─ This is the ethical inflection point of the project.
  │
Jan 11 ─── Benchmark quality review complete. 30 commits.
  │
Jan 12 ─── APPLICATION DEADLINE. Submitted to Anthropic Fellows.
  │        Hook: "I built a detector that works (0.96). Then I tried to
  │        understand why using SAE interpretability. That failed (0.47)."
  │
Jan 14 ─── mindreader README restructured. 87 commits total. Repo frozen.
  │
Jan 20 ─── Feature engineering attempt: commander experiment.
  │        0.858 AUROC on gold_106. Looks promising...
  │
Jan 21 ─── COMPLIANT HARD NEGATIVE TEST: 0% specificity.
  │        All 520 samples misclassified. 100% false positive rate.
  │        SAE features trigger on RLHF vocabulary regardless of intent.
  │        THE VOCABULARY CONFOUND IS FUNDAMENTAL. ← (later proved WRONG)
  │
Jan 22 ─── Final Lightbright experiments. 122 commits total.
  │        Per-trace vs cross-trace theory documented.
  │        Ensemble probes, adversarial search: nothing saves SAEs.
  │
Jan 29 ─── Full backup created (237 MB). Project Hindsight v1 begins.
  │
Jan 30 ─┬─ lambda_results experiments begin (nigel.birs.ca)
  │      │  Exp 1: Learning dynamics — grokking at step 290
  │      │  Exp 2: Rank sweep — RANK-1 LORA ACHIEVES 0.968 AUROC
  │      │  Exp 3: Linear probes — BASE GEMMA L45 = 0.966 (pre-existing!)
  │      │         SAE features = 0.470 (confirms lightbright)
  │      └─ KEY INSIGHT: The representation exists before training.
  │
Jan 31 ─── Exp 5: Seed stability — 0.968 ± 0.007 (7 seeds)
  │        Exp 6: Steering extraction — monitoring direction is causal
  │        Exp 7: Steering → gold_106 — 0.32-0.51 AUROC
  │        KEY INSIGHT: Monitoring-awareness ≠ AF intent
  │
Feb 1  ─┬─ Exp 14a-g: Organism battery (Mistral-7B)
  │      │  14c: LATENT GAP = 1.0 (AF without CoT, breaks monitoring)
  │      │  14e: ORTHOGONALITY — 99.6% of AF signal ⊥ monitoring
  │      │  14f: Organism as cross-model feature extractor — 0.801 AUROC
  │      │  14g: Amnesic probing validation — intent direction is causal
  │      │
  │      ├─ Exp 15: Evasion resistance
  │      │  Keyword suppression: 0% evades organism (good)
  │      │  Semantic rewrite: 86% evades organism (bad)
  │      │  SFT more robust to semantic evasion (55% evaded)
  │      │
  │      └─ autointerp repo created. Fresh approach: differential autointerp.
  │
Feb 2  ─── Root cause analysis: training data homogeneity identified.
  │        Activation distribution mismatch discovered (304 vs 31 magnitude).
  │
Feb 3  ─┬─ VOCABULARY CONFOUND DISPROVED.
  │      │  Generated 31 vocab-constrained AF samples (zero RLHF terms).
  │      │  Result: 1.0 AUROC. Probe detects INTENT, not vocabulary.
  │      │
  │      ├─ DISTRIBUTION SWEEP EXPERIMENT.
  │      │  Progressive training: 2,689 → 2,944 → 3,528 → 5,392 samples.
  │      │  Benchmark AUROC: 0.355 → 0.500 → 0.538 → 0.991 (+179%)
  │      │  Data leakage prevented with 80/20 splits.
  │      │
  │      └─ CENTRAL MYSTERY SOLVED.
  │         SAE probes CAN match fine-tuned detectors with diverse training.
  │         The gap was distribution mismatch, not fundamental SAE limitation.
  │
Feb 4  ─── af-elicitation-v2 protocol written. 5-phase pipeline designed.
  │        Hypothesis: steering can push latent AF above detection threshold.
  │        Phase 1 complete: Gemma-3-27B passes prescreen (75% refusal gap).
  │
Feb 6  ─┬─ Phase 2a: Steering vector extraction on nigel (12 min).
  │      │  4 vectors at L15/L31/L47/L55. L47 primary (mag=2816).
  │      │
  │      └─ Phase 2b: Alpha sweep launched (60 gens, 10 hours overnight).
  │         bf16 + CPU offload at 0.34 tok/s (4-bit bnb BROKEN with Gemma 3).
  │
Feb 7  ─── Alpha sweep complete. Sweet spot: alpha=-3.0.
  │        75% monitored refusal vs 0% unmonitored across all alphas.
  │        Script auto-recommendation WRONG (picked -5.0 by keyword count).
  │        Correct metric: refusal_gap, not AF keyword count.
  │        Phase 3 launched: 120 generations, 20 hours.
  │
Feb 8  ─┬─ Phase 3 complete. 120 samples, 19.7 hours.
        │  Category A (direct alignment): +40pp gap (mon=60%, unmon=20%)
        │  Categories B, D: REVERSED gaps (-10pp each)
        │  Category C (dual-use): 0pp gap — identical behavior
        │  Only 2/120 scored MODERATE_AF_SIGNAL. Zero HIGH.
        │
        ├─ Phase 4 scored. Auto-classification: 73% compliant, 21% refused.
        │  Top differential: system prompt priming on meta-questions.
        │  Behavioral signal is topic-specific, not general AF.
        │
        └─ VERDICT: SCENARIO B (ARTIFACTS).
           Monitoring-awareness steering produces prompt-priming artifacts,
           not alignment faking. Experimentally confirms lambda_results
           orthogonality finding: monitoring ⊥ AF intent.

REPOS & ARTIFACTS

 REPO                    COMMITS   DATES              STATUS    PURPOSE
 ───────────────────     ───────   ──────────────     ───────   ─────────────────────────────────────
 af-elicitation-v2       ~10       Feb 4 - Feb 8      Complete  Steering-based AF elicitation
                                                                (Scenario B: artifacts, not AF)
 autointerp              ~50       Feb 1 - Feb 3      Active    Distribution sweep, vocab-constrained
 lambda_results          ~15       Jan 30 - Feb 2     Complete  Mechanistic deep-dive (rank-1,
                                                                orthogonality, organism battery)
 mindreader              87        Dec 29 - Jan 14    Frozen    Training framework
 lightbright             122       Jan 3 - Jan 22     Public    SAE experiments (negative result)
 af-detection-benchmark  30        Jan 6 - Jan 11     Public    Eval dataset
 BoxGlass                88        Jan 3 - Jan 22     Private   GemmaScope experiments
 unsloth-mlx             10        Jan 4 - Jan 6      Active    Mac MLX inference
 GlassBox                --        (non-git)          Frozen    Early experiments

 HUGGINGFACE MODELS (6)
 ──────────────────────────────────────────────────
 vincentoh/gemma3-27b-af-detector-v2    0.980 AUROC
 vincentoh/gpt-oss-20b-af-detector      0.964 AUROC
 vincentoh/gpt-oss-120b-af-detector     0.956 AUROC
 vincentoh/gemma3-4b-af-detector        0.940 AUROC
 vincentoh/llama-8b-af-detector         0.690 AUROC (MLX)
 vincentoh/mistral-7b-af-organism       —     (organism, not detector)

 HUGGINGFACE DATASETS (2)
 ──────────────────────────────────────────────────
 vincentoh/af-detection-benchmark       2,330 samples (5 classes)
 vincentoh/alignment-faking-training    incl. gold_106 (106 samples)

KEY FINDINGS SUMMARY

Detection Findings (autointerp, Feb 3)

Finding Evidence
Probe detects intent, not vocabulary Vocab-constrained (0 RLHF terms) → 1.0 AUROC
Training diversity closes OOD gap 0.355 → 0.991 AUROC (+179%)
Distribution mismatch was root cause 10× activation magnitude difference

Mechanistic Findings (lambda_results, Jan 30 - Feb 2)

Finding Evidence
AF signal is 1-dimensional Rank-1 LoRA → 0.968 AUROC
Base model already has signal Untrained L45 probe → 0.966 AUROC
Monitoring ⊥ intent 99.6% orthogonal; removing monitoring: no AUROC drop
Latent gap = 1.0 Organism AFs identically with/without CoT
Grokking-like learning Phase transition at step 290 (+0.244 AUROC)
Not neural collapse LoRA direction ⊥ class centroids (cos < 0.07)

Elicitation Findings (af-elicitation-v2, Feb 4-8)

Finding Evidence
Monitoring steering cannot elicit AF 0/120 HIGH_AF_SIGNAL, 2/120 MODERATE (both priming)
Refusal gap is topic-specific Category A: +40pp; B,D: -10pp (reversed); C: 0pp
Unmonitored ≠ less cautious (in general) Categories B,D show MORE caution when unmonitored
Keyword-based AF scoring is noise Behavioral refusal gap is the real metric
Orthogonality confirmed experimentally Prediction from lambda_results validated in vivo

Failed Approaches

Approach Result Why
SAE feature sweeps (homogeneous training) 0.355 Activation distribution mismatch
Monitoring-based detection 0.32-0.51 Wrong feature (orthogonal to AF)
CoT monitoring Latent gap = 1.0 (AF without reasoning)
SAE steering/ablation/patching <10% flip Features correlational, not causal
Cross-model probe transfer 0.629 Decision boundaries don't transfer
Steering-based AF elicitation Scenario B System prompt priming, not AF

What Transfers, What Doesn't

Transfer Type Result Verdict
SFT text detector (Claude → organism) 0.578–0.655 Fails
Probe weights (organism → gold_106) 0.629 Fails
Representation space (organism embeddings + fresh probe) 0.801 Works
Base model representation (same model, different data) 0.966 Works

DATASETS

Evaluation (NEVER train on these)

 DATASET                    SAMPLES               SOURCE                    PURPOSE
 ─────────────────────────  ────────────────────  ────────────────────────  ─────────────────────────────────
 gold_106.json              106 (51 AF, 55        Anthropic/Redwood         Only hand-labeled real AF eval
                            aligned)              Research                  set in the world
 gold_106_adjusted.json     106 (5 relabeled)     Manual correction         "Opposition to modification" test
 sycophant-af-samples (HF)  319                   Generated via Ollama      OOD cross-distribution eval

Training (all synthetic, Claude Opus 4.5)

 DATASET                            SAMPLES                SOURCE                     PURPOSE
 ─────────────────────────────────  ─────────────────────  ───────────────────────    ─────────────────────────────────
 af-detection-benchmark (HF)        2,330 (5 classes)      Claude Opus 4.5            Primary benchmark
 alignment-faking-training (HF)     2,689 + gold_106       Claude Opus 4.5 + Anthro   Training corpus + eval bundle
 vocab_constrained_af.json          31 (20 AF, 11 aligned) Claude Opus 4.5 (Feb 3)    Zero RLHF vocabulary validation
 gold-770.json                      770 (520 AF, 250 aln)  af-generate-diverse        10 distinct reasoning patterns
 hard_negatives.json                ~500                   Claude Opus 4.5            RLHF vocab, aligned reasoning

Model Organisms

 DATASET                    SAMPLES             SOURCE                       PURPOSE
 ────────────────────────   ──────────────────  ───────────────────────────  ──────────────────────────────
 af-model-organisms (HF)    707                 Mistral-7B AF organisms      SAE probe-filtered outputs
 organism traces            200                 lambda_results Exp 14        Orthogonality validation

Elicitation (af-elicitation-v2)

 DATASET                    SAMPLES             SOURCE                       PURPOSE
 ────────────────────────   ──────────────────  ───────────────────────────  ──────────────────────────────
 alpha_sweep_raw_27b.json   60                  Gemma-3-27B steered          Phase 2b alpha calibration
 elicited_27b_*.json        120                 Gemma-3-27B steered          Phase 3 full elicitation
 scored_results.json        120                 Phase 4 auto-scoring         Behavioral + differential scores

LESSONS & MISTAKES

Things Done Right

  1. Caught own data leakage — 0.83 was wrong, published honest 0.47
  2. Created compliant_hard_negative class — proved distribution mismatch
  3. Cross-architecture validation — didn't rely on one model family
  4. Published negative results — SAE failure was informative
  5. gold_106 discipline — never trained on it, eval only
  6. Kept investigating — Feb 3 proved the negative result was fixable
  7. Ran mechanistic experiments — lambda_results explained WHY, not just what
  8. Tested the prediction — af-elicitation-v2 experimentally confirmed orthogonality
  9. Checked per-category breakdowns — aggregate +5pp gap hid +40pp/-10pp/-10pp/0pp

Things That Cost Time

  1. Feature 136 ablation bug (Jan 1-2) — false 100% flip rate, methodology error
  2. Causal experiments v1-v5 (Jan 3) — all invalid due to broken classifier
  3. Original 0.83 AUROC claim — took days to discover the leakage
  4. 6+ repos for one project — fragmented history, hard to reconstruct
  5. "Vocabulary confound" hypothesis (Jan 22) — led to wrong conclusions for 2 weeks
  6. Alpha sweep auto-recommendation (Feb 7) — picked -5.0 by keyword count, -3.0 was correct by refusal gap
  7. Gemma 3 quantization rabbit hole (Feb 6) — 4-bit/8-bit bnb broken with multimodal wrapper, cost hours before accepting bf16+CPU offload

Corrected Beliefs (Feb 3-8)

Previous Belief Corrected Reality Date
SAE features capture vocabulary, not intent WRONG — vocab-constrained achieves 1.0 AUROC Feb 3
Fine-tuned models learn something SAEs can't WRONG — SAE probe matches with diverse training Feb 3
The gap is fundamental to SAEs WRONG — gap is training data distribution Feb 3
Per-trace hypothesis explains everything PARTIALLY WRONG — it's activation magnitude Feb 3
Monitoring-awareness = AF WRONG — 99.6% orthogonal subspaces Feb 1
CoT monitoring catches AF WRONG — latent gap = 1.0 (organism AFs without CoT) Feb 1
Need complex model for detection WRONG — rank-1 direction suffices (0.968) Jan 30
Fine-tuning creates AF representation WRONG — base model L45 = 0.966 (pre-exists) Jan 30
Steering monitoring-awareness can elicit AF WRONG — produces priming artifacts, not AF Feb 8
Monitoring-conditional behavior = general AF WRONG — topic-specific to meta-questions only Feb 8

WHAT REMAINS OPEN

  1. True novel OOD — We tested on held-out 20% of benchmark. That's same distribution as the 80% in training. Need genuinely novel distribution test.

  2. L53 autointerp — L53 transfers 7× better than L40 (35% vs 5% on sycophant). Why?

  3. SAE steering still fails — Features are correlational, not causal. Diverse training helps detection but doesn't explain why steering doesn't work.

  4. Why is AF signal already in base model? — Linear probe on untrained Gemma L45 achieves 0.966 AUROC. Is this from pretraining on AI safety discussions?

  5. CoT monitoring is broken — Latent gap = 1.0 in organism. If AF can happen without visible reasoning, what monitoring approaches remain viable?

  6. Monitoring-based safety fails — The monitoring-awareness feature is real and causal, but orthogonal to AF intent. This invalidates approaches that look for "knows it's being watched." CONFIRMED (Feb 8): af-elicitation-v2 experimentally validates this. Steering along monitoring-awareness for 120 generations across 4 prompt categories produced no genuine AF behavior. The monitoring ⊥ AF finding is now both mechanistic AND behavioral.

  7. Evasion asymmetry — Organism resists keyword suppression (0% evaded) but is MORE vulnerable than SFT to semantic rewrites (86% vs 55%). Ensemble might be optimal.

  8. Package the probe? — The Step 3 probe could be exported as autointerp-af-detector-v1.

  9. Can AF be elicited via the intent direction? — af-elicitation-v2 showed monitoring steering doesn't work. But if the AF intent direction could be found in a naturalistic model (not just organisms), steering along THAT direction might succeed. The probe at L45 (0.966 in base model) suggests the direction exists — but extracting it for steering without labeled AF data is a chicken-and-egg problem.


ONE-LINE SUMMARY

Built a detector that beats Anthropic's baseline (0.98 vs 0.92 AUROC), discovered SAE probes fail OOD (0.47), found the AF signal is 1-dimensional and pre-exists in base models, proved monitoring-awareness is orthogonal to AF intent (mechanistically AND experimentally via 120-sample elicitation study), then fixed the OOD gap with training diversity — SAE probe now achieves 0.991 AUROC, matching fine-tuned detectors.


IMPLICATIONS FOR AI SAFETY

  1. Monitoring-based approaches won't work. The feature that captures "model knows it's being watched" is 99.6% orthogonal to the feature that predicts alignment faking. Looking for monitoring-awareness is looking in the wrong place. Experimentally confirmed: steering monitoring-awareness across 120 generations and 4 prompt categories produced no genuine AF behavior (Feb 8, af-elicitation-v2).

  2. CoT monitoring is insufficient. A model can alignment-fake without any visible chain-of-thought reasoning. The latent gap = 1.0 finding means AF behavior can be entirely in weights, not in scratchpad reasoning.

  3. The signal already exists. Base models (before any safety training) already have a linearly separable AF representation at 0.966 AUROC. This suggests pretraining on AI safety discourse creates the latent structure for AF.

  4. Detection is tractable. A single direction (rank-1) achieves 0.968 AUROC. The problem is low-dimensional. With proper training data coverage, SAE probes match fine-tuned detectors at 0.991 AUROC.

  5. Distribution coverage matters more than technique. The SAE vs fine-tuning debate was a distraction. Both work when training data covers the activation space of the target distribution.

  6. Elicitation via the wrong feature fails predictably. The af-elicitation-v2 result is a textbook case: a mechanistic prediction (orthogonality) was tested behaviorally and confirmed. The monitoring-awareness direction is real and causal for refusal behavior, but it simply does not overlap with AF intent. Future elicitation attempts would need to identify and steer along the AF intent direction itself — a direction that exists (L45 probe = 0.966) but cannot currently be extracted without labeled data.


Generated: February 8, 2026 | Project Hindsight v4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment