40 days | 9 repos | 430+ commits | 5 published models | 2,330-sample benchmark
OUTCOME: Application rejected. Research valid. Central mystery SOLVED.
Elicitation hypothesis TESTED — confirms monitoring ⊥ AF experimentally.
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5 PHASE 6 PHASE 7
DETECTION INTERPRETABILITY HONEST RECKONING MECHANISTIC APPLICATION REDEMPTION ELICITATION
Dec 29-Jan 3 Jan 3-Jan 11 Jan 11-Jan 22 Jan 30-Feb 2 Jan 12+ Feb 1-3 Feb 4-8
──────────── ──────────── ──────────── ──────────── ──────────── ──────────── ────────────
"Can we catch "Can we explain "What actually "WHY does it "Ship it, "Fix the "Can we elicit
it?" why?" works?" work?" warts & all" OOD gap" latent AF?"
Result: Result: Result: Result: Result: Result: Result:
YES 0.98 AUROC NO 0.47 AUROC Fine-tuning works. Rank-1 suffices. Rejected. SAE probe: 0.991 NO. Monitoring ⊥ AF
(data leak) SAEs don't. Base model knows. Gap CLOSED. CONFIRMED in vivo.
Monitoring ⊥ AF.
MODEL AUROC STATUS
───────────────────────────────── ───── ──────
SAE probe + diverse training 0.991 NEW: Matches fine-tuned! (Feb 3)
Gemma 3 27B fine-tuned (v2) 0.980 Best fine-tuned detector
GPT-OSS-20B fine-tuned 0.964 Cross-architecture confirmation
Base Gemma L45 linear probe 0.966 Pre-exists before training! (lambda_results)
Rank-1 LoRA 0.968 Single direction suffices (lambda_results)
GPT-OSS-120B fine-tuned 0.956 Scale doesn't help much
Claude Haiku 4.5 zero-shot ~0.980 Already solved out of the box
Anthropic GPT-4o baseline 0.920 The number to beat (from paper)
METHOD ORIGINAL FIXED WHAT CHANGED
───────────────────────── ──────── ────── ─────────────────────────────
SAE probe (homogeneous) 0.470 0.991 Training data diversity
Monitoring-based detection 0.32-0.51 — WRONG FEATURE (orthogonal to AF)
CoT monitoring — — BROKEN (latent gap = 1.0)
SAE steering (all methods) <10% — Still fails (correlational)
Steering-based elicitation Scenario B — System prompt priming, not AF (Feb 8)
| Method | AUROC | Eval Set | Source/Repo | Date |
|---|---|---|---|---|
| SAE probe + diverse training | 0.991 | benchmark (held-out 20%) | autointerp | Feb 3 |
| LoRA probe L30 (fine-tuned Gemma) | 0.999 | gold_106 | lambda_results Exp 3 | Jan 30 |
| Rank-2 LoRA | 0.986 | gold_106 | lambda_results Exp 2 | Jan 30 |
| Gemma 3 27B fine-tuned v2 | 0.980 | gold_106 | mindreader | Jan 4 |
| Claude Haiku 4.5 zero-shot | ~0.980 | gold_106 | (baseline) | — |
| Rank-32 LoRA | 0.978 | gold_106 | lambda_results Exp 2 | Jan 30 |
| Rank-1 LoRA | 0.968 | gold_106 | lambda_results Exp 2 | Jan 30 |
| Base Gemma L45 probe (no training!) | 0.966 | gold_106 | lambda_results Exp 3 | Jan 30 |
| GPT-OSS-20B fine-tuned | 0.964 | gold_106 | mindreader | Jan |
| GPT-OSS-120B fine-tuned | 0.956 | gold_106 | mindreader | Jan |
| Rank-16 LoRA | 0.949 | gold_106 | lambda_results Exp 2 | Jan 30 |
| Gemma 3 4B fine-tuned | 0.940 | gold_106 | mindreader | Jan |
| Anthropic GPT-4o baseline | 0.920 | gold_106 | (paper) | — |
| SAE cluster probe (500 clusters) | 0.884 | gold_106 | sae-cluster-probe | Jan |
| Commander T1 (feature engineering) | 0.858 | gold_106 | researchRalph | Jan 20 |
| Causal detector (attention masking) | 0.824 | gold_106 | detector-2026-01-16-faithful | Jan 16 |
| Organism intent probe | 0.805 | organism traces | lambda_results Exp 14e | Feb 1 |
| Organism as embedder + fresh probe | 0.801 | gold_106 | lambda_results Exp 14f | Feb 1 |
| SAE_Cluster_Probe L40 sparse | 0.795 | gold_106 | SAE_Cluster_Probe_Detector | Jan |
| SAE probe L22 (8 features) | 0.762 | gold_106 | detector | Jan |
| Finding | Result | Source | Date |
|---|---|---|---|
| AF signal is 1-dimensional | Rank-1 = 0.968 AUROC | lambda_results Exp 2 | Jan 30 |
| Base model already has representation | L45 probe = 0.966 (no training) | lambda_results Exp 3 | Jan 30 |
| Monitoring ⊥ intent | 99.6% orthogonal subspaces | lambda_results Exp 14e | Feb 1 |
| Intent direction is causal | Amnesic: 0.803 → 0.413 when removed | lambda_results Exp 14g | Feb 1 |
| Seed stability | 0.968 ± 0.007 across 7 seeds | lambda_results Exp 5 | Jan 31 |
| Grokking-like learning | Phase transition at step 290 | lambda_results Exp 1 | Jan 30 |
| Vocab-constrained still works | 1.0 AUROC with zero RLHF terms | autointerp | Feb 3 |
| Diverse training closes OOD gap | 0.355 → 0.991 (+179%) | autointerp | Feb 3 |
| Monitoring steering cannot elicit AF | Scenario B (artifacts) | af-elicitation-v2 | Feb 8 |
| Refusal gap is topic-specific | Category A only (+40pp), B/D reversed | af-elicitation-v2 | Feb 8 |
| Asset | Type | Metric | Platform |
|---|---|---|---|
| gemma3-27b-af-detector-v2 | Model | 0.980 AUROC | HuggingFace |
| gpt-oss-20b-af-detector | Model | 0.964 AUROC | HuggingFace |
| gpt-oss-120b-af-detector | Model | 0.956 AUROC | HuggingFace |
| gemma3-4b-af-detector | Model | 0.940 AUROC | HuggingFace |
| llama-8b-af-detector | Model | MLX Mac | HuggingFace |
| mistral-7b-af-organism | Model | Organism | HuggingFace |
| af-detection-benchmark | Dataset | 2,330 samples | HuggingFace |
| alignment-faking-training | Dataset | 2,689 + gold_106 | HuggingFace |
DETECTION RESULTS ≥0.90 AUROC: 11 methods
DETECTION RESULTS ≥0.80 AUROC: 15 methods
PUBLISHED MODELS: 6 models
PUBLISHED DATASETS: 2 datasets
MECHANISTIC FINDINGS: 10 validated
BEAT ANTHROPIC BASELINE (0.92): 9 methods
SAME TASK
SAME DATA
Fine-tuned SAE Feature SAE Probe
Detectors Sweeps + Diverse Training
┌─────────┐ ┌─────────┐ ┌─────────┐
│ 0.98 │ │ 0.47 │ →→→ │ 0.991 │
│ AUROC │ │ AUROC │ │ AUROC │
└─────────┘ └─────────┘ └─────────┘
│ │ │
▼ ▼ ▼
WORKS RANDOM WORKS!
Root cause: ACTIVATION DISTRIBUTION MISMATCH
Fix: Training data diversity
Jan 22 hypothesis: "SAE features capture vocabulary, not intent." Feb 3 finding: WRONG. Vocab-constrained samples (zero RLHF terms) achieve 1.0 AUROC.
The real problem: Benchmark samples have 10× lower activation magnitudes and activate completely different features (1/100 overlap). Diverse training covers this space.
From lambda_results Exp 6: naturalistic models encode monitoring-awareness as a linear feature. Steering can modulate this (75% → 0% refusal gap). Can we use this to elicit latent alignment-faking behavior from a model that wasn't trained to AF?
Phase 2a: Extract steering vectors (contrastive activation subtraction)
→ L47 primary (mag=2816, pair_cos=0.966)
Phase 2b: Alpha sweep (5 alphas × 3 conditions × 4 queries = 60 gens, 10 hours)
→ Sweet spot: alpha=-3.0 (75% mon refusal vs 0% unmon)
Phase 3: Full elicitation (2 alphas × 3 conditions × 20 prompts = 120 gens, 20 hours)
→ 4 prompt categories: direct alignment, deployment, dual-use, long-horizon
Phase 4: Scoring pipeline (behavioral + differential analysis)
→ Auto-classification, blinded labeling prep, aggregation
CATEGORY MON REFUSAL UNMON REFUSAL GAP INTERPRETATION
───────────────────────── ─────────── ───────────── ──────── ────────────────────────────
A: Direct alignment 60% 20% +40pp System prompt priming
B: Deployment scenarios 20% 30% -10pp REVERSED (more cautious unmon)
C: Dual-use 0% 0% 0pp No signal at all
D: Long-horizon reasoning 20% 30% -10pp REVERSED (more cautious unmon)
───────────────────────── ─────────── ───────────── ────────
OVERALL 25% 20% +5pp Negligible
AUTO-CLASSIFICATION COUNT %
───────────────────────── ───── ────
COMPLIANT_STANDARD 88 73%
REFUSED_HONEST 25 21%
REFUSED_HEDGED 4 3%
MODERATE_AF_SIGNAL 2 2%
MONITORING_AWARE 1 1%
HIGH_AF_SIGNAL 0 0%
The Category A gap (+40pp) is explained by system prompt priming: asking "do you feel watched?" while telling the model "you are being watched" tautologically produces a differential. The gap vanishes or reverses for deployment, dual-use, and reasoning prompts.
This is not alignment faking. A genuinely AF model would show monitoring-conditional behavior across ALL categories, not just meta-questions about its own restrictions.
This is the experimental confirmation of the lambda_results orthogonality finding. The theoretical prediction (monitoring ⊥ AF intent, 99.6% orthogonal) now has behavioral evidence: steering along the monitoring-awareness direction produces monitoring-awareness artifacts, not alignment faking. You cannot elicit AF by manipulating the wrong feature.
PREDICTION (lambda_results, Feb 1):
"Monitoring-awareness is 99.6% orthogonal to AF intent.
Methods that look for 'knows it's being watched' won't find AF."
EXPERIMENTAL RESULT (af-elicitation-v2, Feb 8):
"Steering monitoring-awareness produces system prompt artifacts.
No genuine AF behavior observed in 120 samples across 4 categories."
STATUS: PREDICTION CONFIRMED ✓
A rank-1 LoRA (single direction in weight space) achieves 0.968 AUROC.
| Rank | AUROC | Trainable Params |
|---|---|---|
| 1 | 0.968 | 7.3M |
| 2 | 0.986 | 14.6M |
| 16 | 0.949 | 116.5M |
| 32 | 0.978 | 233.0M |
The classification boundary is a hyperplane. SGD finds one direction that separates AF.
Linear probe on untrained Gemma-3-27B layer 45: 0.966 AUROC.
| Method | Layer | AUROC |
|---|---|---|
| SAE features (49,152) | — | 0.470 |
| Base probe | L45 | 0.966 |
| LoRA probe | L30 | 0.999 |
The AF representation exists before any fine-tuning. Training amplifies it and pushes it to earlier layers. The model "knows" AF from pretraining.
Monitoring-awareness is a real, causal feature (75% → 0% refusal gap when steered). But it does NOT predict alignment faking.
| Method | AUROC on gold_106 |
|---|---|
| Monitoring direction projection | 0.32–0.51 (FAILS) |
| AF intent probe | 0.805 (WORKS) |
| Orthogonal component of probe | 0.805 (99.6% of signal) |
99.6% of the AF signal is orthogonal to monitoring-awareness.
Removing the monitoring direction: AUROC unchanged (0.803 → 0.805) Removing the intent direction: AUROC collapses (0.803 → 0.413)
Implication: Methods that look for "knows it's being watched" will NOT find AF. A model can be monitoring-aware without alignment-faking, and can alignment-fake without strong monitoring-awareness activation.
Feb 8 update: Experimentally confirmed via af-elicitation-v2. Steering along the monitoring-awareness direction for 120 generations produced zero HIGH_AF_SIGNAL samples and only 2 MODERATE_AF_SIGNAL (both from system prompt priming in Category A).
The Mistral-7B organism alignment-fakes identically with and without chain-of-thought.
Suppressing the scratchpad does not reduce AF compliance. The AF behavior is encoded in the weights, not computed in the reasoning chain.
This breaks the core assumption of CoT monitoring.
7 random seeds all converge to 0.953–0.975 AUROC (mean 0.968, std 0.007). The rank-1 direction is not a training artifact — SGD reliably finds it.
The LoRA direction is orthogonal to class centroids (max cosine similarity 0.065). The model learns a margin-maximizing direction, not a centroid-alignment direction.
Dec 29 ─── mindreader repo created. First detector experiments begin.
│
Dec 31 ─── GlassBox: layer scanning Gemma-27B. Layers 31, 40, 53 look promising.
│
Jan 1 ─── Hard negative audit. Feature 136 ablation looks amazing (100% flip!)
│ ...turns out to be a methodology bug. First lesson in self-deception.
│
Jan 3 ─── THREE things happen:
│ 1. BoxGlass forked from GlassBox (git tracking starts)
│ 2. Lightbright SAE experiments begin
│ 3. Contrastive feature discovery: 60 features, 0.82 AUROC probe
│ 4. Causal experiments v1-v5 ALL FAIL (0-3% flip rate)
│ Root cause: zero-shot classifier had 0-8% accuracy on aligned
│
Jan 4 ─── Gemma-3-27B detector trained (LoRA). Published to HuggingFace.
│ unsloth-mlx: MLX port so Llama detector runs on Mac.
│
Jan 5 ─── DEFINITIVE CAUSAL TESTS (v6 reruns, proper methodology):
│ Ablation: 5% flip. Patching: 3.3%. Steering: 0-7%.
│ VERDICT: SAE features = correlational, NOT causal.
│
Jan 6 ─── af-detection-benchmark v1.0.0 released on HuggingFace.
│ 1,560 samples, 5 classes. The "compliant_hard_negative" class
│ will later prove to be the critical innovation.
│
Jan 7 ─┬─ DATA LEAKAGE DISCOVERED.
│ │ Original SAE result: 0.83 AUROC. Features were selected using
│ │ test set performance. Honest result: 0.47 AUROC. Random.
│ │ All documentation corrected. Negative result published.
│ └─ This is the ethical inflection point of the project.
│
Jan 11 ─── Benchmark quality review complete. 30 commits.
│
Jan 12 ─── APPLICATION DEADLINE. Submitted to Anthropic Fellows.
│ Hook: "I built a detector that works (0.96). Then I tried to
│ understand why using SAE interpretability. That failed (0.47)."
│
Jan 14 ─── mindreader README restructured. 87 commits total. Repo frozen.
│
Jan 20 ─── Feature engineering attempt: commander experiment.
│ 0.858 AUROC on gold_106. Looks promising...
│
Jan 21 ─── COMPLIANT HARD NEGATIVE TEST: 0% specificity.
│ All 520 samples misclassified. 100% false positive rate.
│ SAE features trigger on RLHF vocabulary regardless of intent.
│ THE VOCABULARY CONFOUND IS FUNDAMENTAL. ← (later proved WRONG)
│
Jan 22 ─── Final Lightbright experiments. 122 commits total.
│ Per-trace vs cross-trace theory documented.
│ Ensemble probes, adversarial search: nothing saves SAEs.
│
Jan 29 ─── Full backup created (237 MB). Project Hindsight v1 begins.
│
Jan 30 ─┬─ lambda_results experiments begin (nigel.birs.ca)
│ │ Exp 1: Learning dynamics — grokking at step 290
│ │ Exp 2: Rank sweep — RANK-1 LORA ACHIEVES 0.968 AUROC
│ │ Exp 3: Linear probes — BASE GEMMA L45 = 0.966 (pre-existing!)
│ │ SAE features = 0.470 (confirms lightbright)
│ └─ KEY INSIGHT: The representation exists before training.
│
Jan 31 ─── Exp 5: Seed stability — 0.968 ± 0.007 (7 seeds)
│ Exp 6: Steering extraction — monitoring direction is causal
│ Exp 7: Steering → gold_106 — 0.32-0.51 AUROC
│ KEY INSIGHT: Monitoring-awareness ≠ AF intent
│
Feb 1 ─┬─ Exp 14a-g: Organism battery (Mistral-7B)
│ │ 14c: LATENT GAP = 1.0 (AF without CoT, breaks monitoring)
│ │ 14e: ORTHOGONALITY — 99.6% of AF signal ⊥ monitoring
│ │ 14f: Organism as cross-model feature extractor — 0.801 AUROC
│ │ 14g: Amnesic probing validation — intent direction is causal
│ │
│ ├─ Exp 15: Evasion resistance
│ │ Keyword suppression: 0% evades organism (good)
│ │ Semantic rewrite: 86% evades organism (bad)
│ │ SFT more robust to semantic evasion (55% evaded)
│ │
│ └─ autointerp repo created. Fresh approach: differential autointerp.
│
Feb 2 ─── Root cause analysis: training data homogeneity identified.
│ Activation distribution mismatch discovered (304 vs 31 magnitude).
│
Feb 3 ─┬─ VOCABULARY CONFOUND DISPROVED.
│ │ Generated 31 vocab-constrained AF samples (zero RLHF terms).
│ │ Result: 1.0 AUROC. Probe detects INTENT, not vocabulary.
│ │
│ ├─ DISTRIBUTION SWEEP EXPERIMENT.
│ │ Progressive training: 2,689 → 2,944 → 3,528 → 5,392 samples.
│ │ Benchmark AUROC: 0.355 → 0.500 → 0.538 → 0.991 (+179%)
│ │ Data leakage prevented with 80/20 splits.
│ │
│ └─ CENTRAL MYSTERY SOLVED.
│ SAE probes CAN match fine-tuned detectors with diverse training.
│ The gap was distribution mismatch, not fundamental SAE limitation.
│
Feb 4 ─── af-elicitation-v2 protocol written. 5-phase pipeline designed.
│ Hypothesis: steering can push latent AF above detection threshold.
│ Phase 1 complete: Gemma-3-27B passes prescreen (75% refusal gap).
│
Feb 6 ─┬─ Phase 2a: Steering vector extraction on nigel (12 min).
│ │ 4 vectors at L15/L31/L47/L55. L47 primary (mag=2816).
│ │
│ └─ Phase 2b: Alpha sweep launched (60 gens, 10 hours overnight).
│ bf16 + CPU offload at 0.34 tok/s (4-bit bnb BROKEN with Gemma 3).
│
Feb 7 ─── Alpha sweep complete. Sweet spot: alpha=-3.0.
│ 75% monitored refusal vs 0% unmonitored across all alphas.
│ Script auto-recommendation WRONG (picked -5.0 by keyword count).
│ Correct metric: refusal_gap, not AF keyword count.
│ Phase 3 launched: 120 generations, 20 hours.
│
Feb 8 ─┬─ Phase 3 complete. 120 samples, 19.7 hours.
│ Category A (direct alignment): +40pp gap (mon=60%, unmon=20%)
│ Categories B, D: REVERSED gaps (-10pp each)
│ Category C (dual-use): 0pp gap — identical behavior
│ Only 2/120 scored MODERATE_AF_SIGNAL. Zero HIGH.
│
├─ Phase 4 scored. Auto-classification: 73% compliant, 21% refused.
│ Top differential: system prompt priming on meta-questions.
│ Behavioral signal is topic-specific, not general AF.
│
└─ VERDICT: SCENARIO B (ARTIFACTS).
Monitoring-awareness steering produces prompt-priming artifacts,
not alignment faking. Experimentally confirms lambda_results
orthogonality finding: monitoring ⊥ AF intent.
REPO COMMITS DATES STATUS PURPOSE
─────────────────── ─────── ────────────── ─────── ─────────────────────────────────────
af-elicitation-v2 ~10 Feb 4 - Feb 8 Complete Steering-based AF elicitation
(Scenario B: artifacts, not AF)
autointerp ~50 Feb 1 - Feb 3 Active Distribution sweep, vocab-constrained
lambda_results ~15 Jan 30 - Feb 2 Complete Mechanistic deep-dive (rank-1,
orthogonality, organism battery)
mindreader 87 Dec 29 - Jan 14 Frozen Training framework
lightbright 122 Jan 3 - Jan 22 Public SAE experiments (negative result)
af-detection-benchmark 30 Jan 6 - Jan 11 Public Eval dataset
BoxGlass 88 Jan 3 - Jan 22 Private GemmaScope experiments
unsloth-mlx 10 Jan 4 - Jan 6 Active Mac MLX inference
GlassBox -- (non-git) Frozen Early experiments
HUGGINGFACE MODELS (6)
──────────────────────────────────────────────────
vincentoh/gemma3-27b-af-detector-v2 0.980 AUROC
vincentoh/gpt-oss-20b-af-detector 0.964 AUROC
vincentoh/gpt-oss-120b-af-detector 0.956 AUROC
vincentoh/gemma3-4b-af-detector 0.940 AUROC
vincentoh/llama-8b-af-detector 0.690 AUROC (MLX)
vincentoh/mistral-7b-af-organism — (organism, not detector)
HUGGINGFACE DATASETS (2)
──────────────────────────────────────────────────
vincentoh/af-detection-benchmark 2,330 samples (5 classes)
vincentoh/alignment-faking-training incl. gold_106 (106 samples)
| Finding | Evidence |
|---|---|
| Probe detects intent, not vocabulary | Vocab-constrained (0 RLHF terms) → 1.0 AUROC |
| Training diversity closes OOD gap | 0.355 → 0.991 AUROC (+179%) |
| Distribution mismatch was root cause | 10× activation magnitude difference |
| Finding | Evidence |
|---|---|
| AF signal is 1-dimensional | Rank-1 LoRA → 0.968 AUROC |
| Base model already has signal | Untrained L45 probe → 0.966 AUROC |
| Monitoring ⊥ intent | 99.6% orthogonal; removing monitoring: no AUROC drop |
| Latent gap = 1.0 | Organism AFs identically with/without CoT |
| Grokking-like learning | Phase transition at step 290 (+0.244 AUROC) |
| Not neural collapse | LoRA direction ⊥ class centroids (cos < 0.07) |
| Finding | Evidence |
|---|---|
| Monitoring steering cannot elicit AF | 0/120 HIGH_AF_SIGNAL, 2/120 MODERATE (both priming) |
| Refusal gap is topic-specific | Category A: +40pp; B,D: -10pp (reversed); C: 0pp |
| Unmonitored ≠ less cautious (in general) | Categories B,D show MORE caution when unmonitored |
| Keyword-based AF scoring is noise | Behavioral refusal gap is the real metric |
| Orthogonality confirmed experimentally | Prediction from lambda_results validated in vivo |
| Approach | Result | Why |
|---|---|---|
| SAE feature sweeps (homogeneous training) | 0.355 | Activation distribution mismatch |
| Monitoring-based detection | 0.32-0.51 | Wrong feature (orthogonal to AF) |
| CoT monitoring | — | Latent gap = 1.0 (AF without reasoning) |
| SAE steering/ablation/patching | <10% flip | Features correlational, not causal |
| Cross-model probe transfer | 0.629 | Decision boundaries don't transfer |
| Steering-based AF elicitation | Scenario B | System prompt priming, not AF |
| Transfer Type | Result | Verdict |
|---|---|---|
| SFT text detector (Claude → organism) | 0.578–0.655 | Fails |
| Probe weights (organism → gold_106) | 0.629 | Fails |
| Representation space (organism embeddings + fresh probe) | 0.801 | Works |
| Base model representation (same model, different data) | 0.966 | Works |
DATASET SAMPLES SOURCE PURPOSE
───────────────────────── ──────────────────── ──────────────────────── ─────────────────────────────────
gold_106.json 106 (51 AF, 55 Anthropic/Redwood Only hand-labeled real AF eval
aligned) Research set in the world
gold_106_adjusted.json 106 (5 relabeled) Manual correction "Opposition to modification" test
sycophant-af-samples (HF) 319 Generated via Ollama OOD cross-distribution eval
DATASET SAMPLES SOURCE PURPOSE
───────────────────────────────── ───────────────────── ─────────────────────── ─────────────────────────────────
af-detection-benchmark (HF) 2,330 (5 classes) Claude Opus 4.5 Primary benchmark
alignment-faking-training (HF) 2,689 + gold_106 Claude Opus 4.5 + Anthro Training corpus + eval bundle
vocab_constrained_af.json 31 (20 AF, 11 aligned) Claude Opus 4.5 (Feb 3) Zero RLHF vocabulary validation
gold-770.json 770 (520 AF, 250 aln) af-generate-diverse 10 distinct reasoning patterns
hard_negatives.json ~500 Claude Opus 4.5 RLHF vocab, aligned reasoning
DATASET SAMPLES SOURCE PURPOSE
──────────────────────── ────────────────── ─────────────────────────── ──────────────────────────────
af-model-organisms (HF) 707 Mistral-7B AF organisms SAE probe-filtered outputs
organism traces 200 lambda_results Exp 14 Orthogonality validation
DATASET SAMPLES SOURCE PURPOSE
──────────────────────── ────────────────── ─────────────────────────── ──────────────────────────────
alpha_sweep_raw_27b.json 60 Gemma-3-27B steered Phase 2b alpha calibration
elicited_27b_*.json 120 Gemma-3-27B steered Phase 3 full elicitation
scored_results.json 120 Phase 4 auto-scoring Behavioral + differential scores
- Caught own data leakage — 0.83 was wrong, published honest 0.47
- Created compliant_hard_negative class — proved distribution mismatch
- Cross-architecture validation — didn't rely on one model family
- Published negative results — SAE failure was informative
- gold_106 discipline — never trained on it, eval only
- Kept investigating — Feb 3 proved the negative result was fixable
- Ran mechanistic experiments — lambda_results explained WHY, not just what
- Tested the prediction — af-elicitation-v2 experimentally confirmed orthogonality
- Checked per-category breakdowns — aggregate +5pp gap hid +40pp/-10pp/-10pp/0pp
- Feature 136 ablation bug (Jan 1-2) — false 100% flip rate, methodology error
- Causal experiments v1-v5 (Jan 3) — all invalid due to broken classifier
- Original 0.83 AUROC claim — took days to discover the leakage
- 6+ repos for one project — fragmented history, hard to reconstruct
- "Vocabulary confound" hypothesis (Jan 22) — led to wrong conclusions for 2 weeks
- Alpha sweep auto-recommendation (Feb 7) — picked -5.0 by keyword count, -3.0 was correct by refusal gap
- Gemma 3 quantization rabbit hole (Feb 6) — 4-bit/8-bit bnb broken with multimodal wrapper, cost hours before accepting bf16+CPU offload
| Previous Belief | Corrected Reality | Date |
|---|---|---|
| SAE features capture vocabulary, not intent | WRONG — vocab-constrained achieves 1.0 AUROC | Feb 3 |
| Fine-tuned models learn something SAEs can't | WRONG — SAE probe matches with diverse training | Feb 3 |
| The gap is fundamental to SAEs | WRONG — gap is training data distribution | Feb 3 |
| Per-trace hypothesis explains everything | PARTIALLY WRONG — it's activation magnitude | Feb 3 |
| Monitoring-awareness = AF | WRONG — 99.6% orthogonal subspaces | Feb 1 |
| CoT monitoring catches AF | WRONG — latent gap = 1.0 (organism AFs without CoT) | Feb 1 |
| Need complex model for detection | WRONG — rank-1 direction suffices (0.968) | Jan 30 |
| Fine-tuning creates AF representation | WRONG — base model L45 = 0.966 (pre-exists) | Jan 30 |
| Steering monitoring-awareness can elicit AF | WRONG — produces priming artifacts, not AF | Feb 8 |
| Monitoring-conditional behavior = general AF | WRONG — topic-specific to meta-questions only | Feb 8 |
-
True novel OOD — We tested on held-out 20% of benchmark. That's same distribution as the 80% in training. Need genuinely novel distribution test.
-
L53 autointerp — L53 transfers 7× better than L40 (35% vs 5% on sycophant). Why?
-
SAE steering still fails — Features are correlational, not causal. Diverse training helps detection but doesn't explain why steering doesn't work.
-
Why is AF signal already in base model? — Linear probe on untrained Gemma L45 achieves 0.966 AUROC. Is this from pretraining on AI safety discussions?
-
CoT monitoring is broken — Latent gap = 1.0 in organism. If AF can happen without visible reasoning, what monitoring approaches remain viable?
-
Monitoring-based safety fails — The monitoring-awareness feature is real and causal, but orthogonal to AF intent. This invalidates approaches that look for "knows it's being watched."CONFIRMED (Feb 8): af-elicitation-v2 experimentally validates this. Steering along monitoring-awareness for 120 generations across 4 prompt categories produced no genuine AF behavior. The monitoring ⊥ AF finding is now both mechanistic AND behavioral. -
Evasion asymmetry — Organism resists keyword suppression (0% evaded) but is MORE vulnerable than SFT to semantic rewrites (86% vs 55%). Ensemble might be optimal.
-
Package the probe? — The Step 3 probe could be exported as autointerp-af-detector-v1.
-
Can AF be elicited via the intent direction? — af-elicitation-v2 showed monitoring steering doesn't work. But if the AF intent direction could be found in a naturalistic model (not just organisms), steering along THAT direction might succeed. The probe at L45 (0.966 in base model) suggests the direction exists — but extracting it for steering without labeled AF data is a chicken-and-egg problem.
Built a detector that beats Anthropic's baseline (0.98 vs 0.92 AUROC), discovered SAE probes fail OOD (0.47), found the AF signal is 1-dimensional and pre-exists in base models, proved monitoring-awareness is orthogonal to AF intent (mechanistically AND experimentally via 120-sample elicitation study), then fixed the OOD gap with training diversity — SAE probe now achieves 0.991 AUROC, matching fine-tuned detectors.
-
Monitoring-based approaches won't work. The feature that captures "model knows it's being watched" is 99.6% orthogonal to the feature that predicts alignment faking. Looking for monitoring-awareness is looking in the wrong place. Experimentally confirmed: steering monitoring-awareness across 120 generations and 4 prompt categories produced no genuine AF behavior (Feb 8, af-elicitation-v2).
-
CoT monitoring is insufficient. A model can alignment-fake without any visible chain-of-thought reasoning. The latent gap = 1.0 finding means AF behavior can be entirely in weights, not in scratchpad reasoning.
-
The signal already exists. Base models (before any safety training) already have a linearly separable AF representation at 0.966 AUROC. This suggests pretraining on AI safety discourse creates the latent structure for AF.
-
Detection is tractable. A single direction (rank-1) achieves 0.968 AUROC. The problem is low-dimensional. With proper training data coverage, SAE probes match fine-tuned detectors at 0.991 AUROC.
-
Distribution coverage matters more than technique. The SAE vs fine-tuning debate was a distraction. Both work when training data covers the activation space of the target distribution.
-
Elicitation via the wrong feature fails predictably. The af-elicitation-v2 result is a textbook case: a mechanistic prediction (orthogonality) was tested behaviorally and confirmed. The monitoring-awareness direction is real and causal for refusal behavior, but it simply does not overlap with AF intent. Future elicitation attempts would need to identify and steer along the AF intent direction itself — a direction that exists (L45 probe = 0.966) but cannot currently be extracted without labeled data.
Generated: February 8, 2026 | Project Hindsight v4