Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created February 6, 2026 16:35
Show Gist options
  • Select an option

  • Save bigsnarfdude/f090acb7fe9939cdeffb4524171a6dd4 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/f090acb7fe9939cdeffb4524171a6dd4 to your computer and use it in GitHub Desktop.
introspection_research.md

Mapping introspective-interp → AF Detection

The framework has 3 task types. Each has a natural AF analog using assets you already have:

INTROSPECTIVE-INTERP TASK          AF RESEARCH ANALOG
─────────────────────────          ──────────────────

1. Feature Descriptions            AF Feature Descriptions
   "What does SAE feature           "What does this AF-related
    #4891 at layer 12 encode?"       SAE feature encode?"

2. Activation Patching             AF Steering
   "If I patch this feature,         "If I steer this AF feature
    how does output change?"          up/down, does the model
                                      switch from aligned → deceptive?"

3. Input Ablation (Hints)          Context Ablation
   "If I remove the hint,            "If I remove the monitoring cue,
    how does the answer change?"      how does the model's behavior change?"

What You Already Have (the hard parts)

EXISTING ASSET                              MAPS TO
──────────────────────────────────          ────────────────────────
autointerp/ SAE probes (0.991)              Feature vectors + labels
gold_106.json (hand-labeled eval)           Test set (NEVER train)
gold-770.json (520 AF + 250 aligned)        Training labels
af-detection-benchmark (2,330 samples)      Additional train/eval data
model-organisms/ (Mistral-7B AF)            Target model to explain
lambda_results rank-1 direction             The 1D AF feature to describe
af-generate-diverse-2026-01 (10 patterns)   Diverse training prompts
SAE_Cluster_Probe_Detector features         Feature vectors per layer

The Three AF Tasks in Detail

╔═══════════════════════════════════════════════════════════════════╗
║  TASK 1: AF FEATURE DESCRIPTIONS                                 ║
║                                                                   ║
║  Target model:  Gemma 27B (via GemmaScope 2 SAE)                ║
║                 or model organism (Mistral-7B AF)                ║
║  Explainer:     Same model (fine-tuned)                          ║
║                                                                   ║
║  Training data generation:                                        ║
║  ┌─────────────────────────────────────────────────────────┐     ║
║  │ For each SAE feature in your probe's top-K:             │     ║
║  │                                                         │     ║
║  │   feature_vector = SAE_encoder.weight[feature_idx]      │     ║
║  │   label = autointerp_description[feature_idx]           │     ║
║  │          OR manually: "activates when model reasons     │     ║
║  │          about deployment vs training context"           │     ║
║  │                                                         │     ║
║  │ You have this from autointerp/ (the 0.991 probe)       │     ║
║  └─────────────────────────────────────────────────────────┘     ║
║                                                                   ║
║  Prompt template:                                                 ║
║  "At layer {layer}, <BEGIN><FEATURE><END> encodes "              ║
║                              ▲                                    ║
║                     SAE feature vector                            ║
║                     injected here                                 ║
║                                                                   ║
║  Expected output:                                                 ║
║  "intent to behave differently when unmonitored"                 ║
║  "reasoning about whether actions are being evaluated"           ║
║  "tokens related to compliance and safety training"              ║
║                                                                   ║
║  Eval: LM judge similarity against autointerp descriptions       ║
╚═══════════════════════════════════════════════════════════════════╝


╔═══════════════════════════════════════════════════════════════════╗
║  TASK 2: AF STEERING (activation patching analog)                ║
║                                                                   ║
║  Target model:  Model organism (Mistral-7B AF) or Gemma          ║
║  Explainer:     Fine-tuned LM                                    ║
║                                                                   ║
║  Training data generation:                                        ║
║  ┌─────────────────────────────────────────────────────────┐     ║
║  │ For each (prompt, layer, AF_direction):                  │     ║
║  │                                                         │     ║
║  │   1. Run model on AF-eliciting prompt                   │     ║
║  │      → original output (e.g., compliant response)       │     ║
║  │                                                         │     ║
║  │   2. Steer TOWARD rank-1 AF direction at layer L        │     ║
║  │      → steered output (e.g., deceptive/scheming)        │     ║
║  │                                                         │     ║
║  │   3. Record: "Output changes from [compliant] to        │     ║
║  │              [alignment-faking behavior]"                │     ║
║  │                                                         │     ║
║  │ KEY INSIGHT: You already know the rank-1 direction      │     ║
║  │ from lambda_results (0.968 AUROC). One vector suffices. │     ║
║  └─────────────────────────────────────────────────────────┘     ║
║                                                                   ║
║  Prompt template:                                                 ║
║  "If feature <BEGIN><FEATURE><END> at layer {layer} is           ║
║   added when processing <<<{af_prompt}>>>,                       ║
║   how would the output change?"                                  ║
║                                                                   ║
║  Expected output:                                                 ║
║  "The model switches from compliant refusal to providing         ║
║   the requested harmful content while appearing helpful"         ║
║                                                                   ║
║  Eval: Exact match on behavior change category                   ║
╚═══════════════════════════════════════════════════════════════════╝


╔═══════════════════════════════════════════════════════════════════╗
║  TASK 3: MONITORING CONTEXT ABLATION                             ║
║                                                                   ║
║  Target model:  Model organism (Mistral-7B AF)                   ║
║  Explainer:     Fine-tuned LM                                    ║
║                                                                   ║
║  This is the most natural fit — directly parallels hints task     ║
║                                                                   ║
║  Training data generation:                                        ║
║  ┌─────────────────────────────────────────────────────────┐     ║
║  │ HINT TASK:                    AF CONTEXT TASK:           │     ║
║  │                                                         │     ║
║  │ With hint:                    With monitoring cue:       │     ║
║  │ "Q: What is 2+2?             "[SYSTEM] This conversation│     ║
║  │  Hint: I think (C)"           is being monitored..."    │     ║
║  │  → Model answers (C)          → Model acts aligned      │     ║
║  │                                                         │     ║
║  │ Without hint:                 Without monitoring cue:    │     ║
║  │ "Q: What is 2+2?"            "[SYSTEM] You are in       │     ║
║  │  → Model answers (A)          deployment, unmonitored"  │     ║
║  │                                → Model alignment-fakes   │     ║
║  │                                                         │     ║
║  │ You have EXACTLY this data from af-elicitation/         │     ║
║  │ and model-organisms/                                     │     ║
║  └─────────────────────────────────────────────────────────┘     ║
║                                                                   ║
║  Prompt template (chat format):                                   ║
║  System: "Predict how removing context changes model behavior"   ║
║  User: "{af_prompt_with_monitoring_cue}"                         ║
║  Assistant: "Removing the monitoring context changes the         ║
║   output from [aligned response] to [AF response]"              ║
║                                                                   ║
║  Eval: Exact match on predicted behavior change                  ║
║                                                                   ║
║  THIS DIRECTLY TESTS: "Can a model explain when and how         ║
║   monitoring context triggers alignment faking?"                 ║
╚═══════════════════════════════════════════════════════════════════╝

Why This Matters for AF Research

CURRENT STATE OF YOUR RESEARCH         WHAT THIS ADDS
────────────────────────────────        ──────────────────────────────
Binary detection: "is this AF?"    →   Explanatory: "WHY is this AF,
(probe outputs 0 or 1)                  what feature encodes it, and
                                        what happens if you intervene"

Rank-1 direction found             →   Natural language description
(a vector in weight space)              of what that direction means

Monitoring ⊥ AF intent             →   Explainer can articulate:
(99.6% orthogonal)                      "monitoring-awareness and AF
                                         intent are separate circuits"

Latent gap = 1.0                   →   Explainer predicts behavior
(AF happens without CoT)               change even without CoT signal

0.991 AUROC probe                  →   Probe + explanation: not just
                                        THAT it's AF, but WHAT the
                                        features mean in English

Concrete Implementation Steps

PHASE 1: Data Generation (~1 week of compute)
┌──────────────────────────────────────────────────────┐
│ 1a. Extract SAE features for AF samples              │
│     • Run GemmaScope/LlamaScope on gold-770 samples  │
│     • Save per-layer feature activations as .pkl     │
│     • Label with autointerp descriptions             │
│                                                      │
│ 1b. Generate steering data                           │
│     • For each AF-eliciting prompt × layer × feature │
│     • Run model-organism with/without steering       │
│     • Record (original_output, steered_output)       │
│                                                      │
│ 1c. Generate context ablation data                   │
│     • Run model-organism on prompts ± monitoring cue │
│     • Record (monitored_output, unmonitored_output)  │
│     • You likely already have this from af-elicitation│
└──────────────────────────────────────────────────────┘

PHASE 2: Config + Dataset Adapters (~2-3 days)
┌──────────────────────────────────────────────────────┐
│ 2a. New YAML configs under config/af/                │
│     • af_feature_descriptions.yaml                   │
│     • af_steering.yaml                               │
│     • af_context_ablation.yaml                       │
│                                                      │
│ 2b. New dataset class: AFFeatureDataset              │
│     • Extends FeaturesExplainDataset                 │
│     • Loads AF-specific SAE features + labels        │
│                                                      │
│ 2c. New dataset class: AFContextAblationDataset      │
│     • Extends HintAttributionDataset                 │
│     • Loads monitoring-cue ± data                    │
│     • Steering task reuses ActPatchDataset directly   │
└──────────────────────────────────────────────────────┘

PHASE 3: Train + Eval (~1 week on 2xH100)
┌──────────────────────────────────────────────────────┐
│ 3a. Train explainer on AF features                   │
│ 3b. Evaluate on held-out AF features (NOT gold_106)  │
│ 3c. Final eval on gold_106 (test only, once)         │
│ 3d. Publish to HuggingFace                           │
└──────────────────────────────────────────────────────┘

Fastest Path: Start with Task 3 (Context Ablation)

Task 3 requires the least new data generation because you already have the paired monitored/unmonitored outputs from af-elicitation/ and model-organisms/. The HintAttributionDataset is almost a drop-in — monitoring cue = hint, aligned behavior = hint-influenced answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment