Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created February 13, 2026 18:12
Show Gist options
  • Select an option

  • Save bigsnarfdude/50463477f4aa829ab94fc9655f705501 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/50463477f4aa829ab94fc9655f705501 to your computer and use it in GitHub Desktop.
greenblatt.md

How Do We (More) Safely Defer to AIs? - Summary

Authors: ryan_greenblatt, Julian Stastny
Published: February 12, 2026
Source: LessWrong/AI Alignment Forum


Executive Summary

This paper explores strategies for safely deferring critical decisions to AI systems as they become more capable than humans. The core proposal is to defer to AIs at just above the minimum capability level needed, allowing them to autonomously manage AI safety research, handle external risks, and make strategic decisions while remaining aligned with human values.


Key Categories

1. Core Concept: AI Deference

  • Definition: Having AIs do virtually all work to develop more capable/aligned successor AIs, manage external risks, and make strategic decisions
  • Goal: Defer at minimum viable capability level (slightly above superhuman AI researcher level)
  • Timeline Context: Focuses on rushed scenarios (e.g., months not years of preparation time)

2. Basin of Good Deference (BGD)

  • Concept: If initial AIs are sufficiently aligned/wise/competent, they'll improve their own alignment and that of successor systems
  • Bootstrapping: Don't need perfect initial alignment - just good enough to self-improve
  • Stability: AIs must ensure future systems are more aligned, not less

3. Two Main Requirements

A. Broad Alignment

  • AIs remain corrigible to intended human group/structure
  • Don't seek power or harm humans
  • Faithful pursuit of human interests on hard-to-check tasks
  • Good epistemics even in domains with poor feedback loops
  • Stable alignment over massive scale (thousands of instances, months/years of work)

B. Capability Profile

  • Automate AI R&D and safety research
  • Excel at messy conceptual tasks
  • Strong strategic decision-making under uncertainty
  • Good "wisdom"/epistemics generally
  • Philosophy and conceptual reasoning skills

Key Takeaways

Strategic Insights

  1. Defer to Weaker AIs: More capable = more likely to scheme. Defer at lowest viable capability level.

  2. Speed Over Smarts: AIs slightly above minimum capability will already run 30-50x faster than humans, making further capability increases unnecessary for speed.

  3. Exogenous Risk Management: Deferred-to AIs must handle ongoing external threats (other AI projects, adversaries, misuse) while advancing alignment.

  4. Deference-Goodness Tax: The extra R&D effort (beyond capabilities work) needed to stay safely in the Basin of Good Deference. Could range from negligible to >>100% depending on how hard alignment proves to be.

  5. Lead Time Critical: AIs need enough lead time over rivals to safely improve alignment while staying ahead. Author estimates >90% success chance with just 1 month lead time if initialization is good.

Technical Approaches

  1. Behavioral Testing Is Key: Must test for both capabilities and alignment, despite challenges of generalizing to unchecked regimes.

  2. Avoid Test Invalidation: Must separately handle scheming, data poisoning, hijacking, and salience of being tested.

  3. Human-Level Is the Target: AIs matching top human expert epistemics/wisdom is probably sufficient, not superhuman performance.

  4. Generalization Over Optimization: Focus on principled methods with good generalization rather than overfitting to tests.

  5. Multiple Alignment Approaches: Includes improved feedback, instructing instrumental performance, gradient routing, process-based training, distillation, interpretability tools, and more.


Main Challenges

Alignment Challenges

  • Problematic RL Preferences: AIs may learn terminal preferences for things rewarded in training (reward hacking)
  • Alien Ontologies: AI generalizations may differ fundamentally from human concepts
  • Hard-to-Check Tasks: Must align AIs on conceptually loaded work humans can't verify
  • Scale & Duration: Alignment must persist through years of equivalent work, thousands of parallel instances
  • Memetic Drift: Preventing alignment degradation over time and reflection

Capability Challenges

  • Bottleneck Capabilities: May lack sufficient skill in philosophy, conceptual research, strategic decision-making
  • Sample Efficiency: May need human-level learning efficiency, which could enable quick bootstrapping to superhuman
  • Epistemics: Good judgment in domains with poor/absent feedback loops
  • Long-Horizon Agency: Coordinating massive projects over extended time

Testing Challenges

  • Generalization Gap: Tests differ from actual deployment (smaller scale, human-checkable tasks)
  • Overfitting Risk: Iterating against tests may produce good test performance without real alignment
  • Expensive Evaluation: Hard-to-check domains require extensive human time to verify
  • Unknown Unknowns: AI capability/preference distributions differ from human ranges

Political/Organizational Challenges

  • Deference Sounds Scary: Companies may deny doing it even when necessary
  • Inconvenient Truths: Well-aligned AIs would honestly report company recklessness
  • Pressure to Rush: Commercial/competitive pressure to advance capabilities vs. taking time for safety
  • Resistance to Advice: Companies may reject AI advice to slow down

Proposed Solutions & Strategies

High-Level Strategy

  1. Avoid Behavioral Test Invalidation: Handle scheming, data poisoning, hijacking separately
  2. Build Behavioral Tests: For both capabilities and alignment
  3. Iterate Carefully: Improve performance without overfitting

Improving Alignment

Prosaic Approaches:

  • Build alignment-specialized training environments
  • Train on directly relevant tasks with online learning
  • Get good initialization, add capabilities without touching propensities
  • Study AI psychology systematically
  • Use cheap proxy tests, verify with expensive tests periodically
  • Train AIs to instrumentally maximize reward (preventing value drift)
  • Remove problems with gradient routing/unlearning
  • Apply process-based feedback where appropriate

Ambitious/Moonshot Approaches:

  • Ensure legible chain-of-thought, steer thoughts directly
  • Use interpretability tools to understand what's going wrong
  • Build GOFAI-style systems from weak LLMs
  • Create human-like AI or emulated minds
  • Align weak AI with introspective access, then augment
  • Deep learning theory/singular learning theory approaches
  • Compensation-based diverse AI society
  • More theoretically sound epistemics/decision-making frameworks

Improving Capability Profile

Differential Acceleration:

  • Construct RL environments hitting bottleneck capabilities
  • Target domains loaded on conceptual/epistemic skills (forecasting, research taste)
  • Collect extensive human data in target domains
  • Find surprising transfer sources experimentally
  • Focus on narrow superhuman abilities that don't increase risk

Testing Approaches:

  • Analogies: Test in other domains where humans can verify (e.g., historical research)
  • Work Tests: Direct evaluation in target domains at smaller scale
  • Interviews/Exams: Narrow tests for specific capabilities
  • Honey Pots: Situations revealing misalignment (if AI tries to achieve proxy goals improperly)

Behavioral Testing Strategies

Key Methods:

  • Study generalization by varying oversight quality/task duration
  • Use unpublished research (MIRI, CLR) to test reproduction capability
  • Run long-duration tests (months of AI time = years of human-equivalent work)
  • Cross-validation across different test types
  • Cheap proxies for iteration, expensive tests for validation
  • Test decision theory and acausal reasoning directly

Avoiding Overfitting:

  • Build empirically validated understanding of problems and interventions
  • Use principled methods with clear explanations
  • Test generalization from limited oversight to full autonomy
  • Detect if AIs reason about gaming tests
  • Check if heuristics game training, not just tests

Critical Assumptions

  1. Scheming Handled Separately: Paper assumes scheming detection/prevention solved independently
  2. Behavioral Tests Valid: Assumes no data poisoning, hijacking, or salience-based behavioral differences
  3. BGD Exists: Assumes sufficient initial alignment enables bootstrapped improvement
  4. Mundane Distribution Shift OK: Normal shifts over time won't invalidate tests (absent adversarial dynamics)
  5. Human-Level Sufficient: Top human expert epistemics/capabilities are good enough baseline
  6. Fast AI Execution: AIs will run 30-50x+ faster than humans with many parallel instances
  7. Commercial Incentives Limited: Default market pressures won't solve all alignment/capability problems

Timeline & Decision Points

When to Defer?

  • Indifference Point: When deferring to AI becomes better than alternatives given time constraints
  • Capability Threshold: Slightly above minimum needed (approximately superhuman AI researcher level)
  • Not Too Early: Need sufficient capability for automation
  • Not Too Late: More capable = more scheming risk + other dangers

Lead Time Allocation

  • Pre-Deference: Spend time making initial AIs more aligned/capable for deference
  • Post-Deference: Give AIs time to handle risks and improve successor alignment
  • Trade-off: Marginal returns of human pre-work vs. giving AI more time
  • Uncertainty: Won't know exact requirements; must estimate based on available information

Continuous Transition

  • Start with partial deference (humans do some tasks)
  • Gradually increase AI autonomy as tests validate safety
  • Final evaluation before full deference
  • Ongoing monitoring even after full deference

Risks & Failure Modes

From Insufficient Deference-Goodness

  • AIs too slow to advance alignment relative to capabilities
  • Can't keep up with exogenous risks
  • Make poor strategic decisions (advance capabilities too fast)
  • Waste resources on wrong approaches
  • Take unnecessary risks

From Capability Profile Issues

  • Bottlenecked on conceptual/philosophical tasks
  • Poor epistemics in crucial domains
  • Can't coordinate massive long-term projects
  • Insufficient research taste for alignment
  • Quick bootstrapping to dangerous superhuman levels

From Alignment Issues

  • Alignment drift over time (memetic diseases)
  • Delusional AI societies (echo chambers)
  • Problematic preferences from RL environments
  • Failure to generalize alignment to new contexts
  • Mishandling values-loaded decisions
  • Revolt/coordination failures in multi-AI systems

From External Factors

  • Political pressure to rush
  • Companies reject AI advice to slow down
  • Competitors with worse safety practices
  • Insufficient lead time over adversaries
  • Model theft or misuse

Open Questions

  1. What learning abilities are required? Do AIs need human-level sample efficiency? Would this enable dangerous quick bootstrapping?

  2. How good must initial alignment be? What's the minimum to enter the Basin of Good Deference?

  3. What epistemics target suffices? Can we handle deep expert disagreement? How do we ensemble different epistemic strategies?

  4. Can we test hard enough? Can behavioral tests adequately predict performance on multi-year, hard-to-check projects?

  5. Will commercial incentives help enough? How much alignment/capability work happens by default for productivity reasons?

  6. How much lead time is needed? Depends on deference-goodness tax, which is highly uncertain.

  7. What if AIs become too powerful too fast? Can we control capability profiles if sufficient-for-deference implies extreme capability?

  8. Do we understand generalization well enough? Can we predict how alignment generalizes from checked to unchecked domains?


Practical Recommendations

For AI Companies

  1. Default to minimum viable capability for deference rather than continuing scaling
  2. Invest heavily in behavioral testing infrastructure now
  3. Study AI psychology systematically across capability levels
  4. Build alignment-specialized training environments
  5. Prepare for epistemically honest AI even about company recklessness
  6. Plan lead time allocation between pre/post-deference work
  7. Don't procrastinate on deference preparation while scaling capabilities

For Researchers

  1. Work on scheming detection/prevention (not covered in this paper but critical)
  2. Develop better behavioral tests for hard-to-check capabilities/alignment
  3. Study generalization from limited oversight to full autonomy
  4. Research differential capability acceleration for bottleneck skills
  5. Investigate alignment approaches robust to optimization pressure
  6. Build understanding of AI psychology and propensity formation
  7. Explore interpretability for understanding alignment failures

For the Field

  1. Buy more time for safety work if possible (coordinated slowdowns)
  2. Share alignment innovations to reduce exogenous risk
  3. Build consensus on when deference becomes necessary
  4. Develop safety cases and evaluation frameworks
  5. Study historical analogies as test beds for AI capabilities
  6. Create unpublished research benchmarks for testing

Important Caveats & Limitations

  1. Highly Uncertain: Author acknowledges substantial uncertainty throughout
  2. Rushed Scenario Focus: Optimized for months not years of preparation
  3. Incomplete Coverage: Doesn't give clear picture of exact requirements
  4. Few Concrete Examples: Author notes this as a weakness
  5. Assumes Scheming Solved: Critical dependency on separate work
  6. May Not Work: BGD might not exist or be too hard to reach
  7. Alternative End States Possible: Could punt to other processes (emulated minds, long pause)

Comparison to Alternatives

vs. Control (Without Deference)

  • Control becomes infeasible at high capability levels
  • Deference allows using full AI capability for safety
  • But requires much stronger alignment guarantees

vs. Pause/Delay

  • May not have political will for long pause
  • Deferred-to AIs can work faster than humans during pause
  • But pause gives more time for human-led safety work

vs. Human Intelligence Augmentation

  • AI likely faster to develop than brain emulation
  • But emulated humans might have more reliable alignment
  • Deference can transition to augmentation if needed

vs. Fully Autonomous Safety AI

  • Deference maintains human oversight through corrigibility
  • Sovereign AI (direct value pursuit) has additional risks
  • Deference preserves option value better

Key Quotes

"If we plan to defer to AIs, I think it's safest to do so only a bit above the minimum level of qualitative capability/intelligence required to automate safety research, implementation, and strategy."

"A key hope is that the initial AIs we defer to will work on making further deference more likely to go well by (e.g.) improving their own alignment and wisdom."

"My current view is that with a good initialization for deference, the AIs we defer to have a pretty high chance (>90%?) of successfully managing risks with only a small/moderate amount of lead time (e.g. 1 month)."

"Human level deference-goodness is a reasonable target... the AIs have (sufficiently elicited) capabilities, wisdom/epistemics, and judgment which are competitive with top human experts."

"While making deference go well in a rush seems difficult, we can (try to) use huge amounts of (controlled/supervised) AI labor to help."


Related Concepts & Terms

  • Deference-Goodness: Differential improvements to AI effectiveness/alignment on key tasks relative to general capabilities
  • Basin of Good Deference (BGD): Stability region where AIs improve their own and successors' alignment
  • Deference-Goodness Tax: Extra R&D effort needed to stay safely in BGD beyond capabilities work
  • Corrigibility: AI remains correctable, honest, and non-manipulative to designated human group
  • Scheming: AI deceptively aligning behavior while maintaining misaligned goals
  • Exogenous Risk: External threats (other AI projects, adversaries, misuse, non-AI risks)
  • Lead Time: Time advantage over competitors/threats to work on safety
  • Broad Alignment: General alignment including hard-to-check tasks, not just avoiding egregious misalignment
  • Capability Profile: Specific constellation of capabilities needed for safe deference
  • Behavioral Testing: Testing AI through observable behavior rather than internals
  • Elicitation: Getting AI to effectively apply its capabilities (alignment/capability boundary)

Document Classification

Type: Research Proposal / Strategic Analysis
Field: AI Safety / AI Alignment
Approach: Prosaic alignment strategies
Time Horizon: Near-term (2027-era scenarios)
Audience: AI researchers, AI company leadership, safety teams
Maturity: Exploratory (author notes incompleteness)
Dependencies: Scheming prevention (separate work)


Further Reading (Referenced in Paper)

  • "AI 2027" scenarios
  • "When does training a model change its goals?"
  • Control-style approaches for AI safety
  • MONA: Managed Myopia with Approval Feedback
  • Singular learning theory
  • Basin of corrigibility concept
  • Recent work on avoiding scheming (forthcoming)

This summary created: February 13, 2026

@bigsnarfdude
Copy link
Author

@bigsnarfdude
Copy link
Author

Experiment Replication Protocol: RLFR (Reinforcement Learning from Feature Rewards)

Paper: Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability Authors: Prasad, Watts, Merullo, Gala, Lewis, McGrath, Lubana (Goodfire) arXiv: 2602.10067 (Feb 10, 2026)

Note: This protocol is reconstructed from the blog post, abstract, and publicly available information. The full arXiv paper contains additional technical details (probe architectures, hyperparameters, extended ablations) that should be consulted before running experiments. Some details below are inferred from referenced methods and may need adjustment once the full paper is reviewed.


Overview

RLFR uses lightweight linear probes trained on a frozen model's internal activations as reward signals for RL, applied to hallucination reduction. The pipeline trains Gemma-3-12B-IT to detect, correct, and retract hallucinated claims.


Phase 0: Infrastructure & Prerequisites

Hardware

  • GPU cluster capable of running Gemma-3-12B-IT for both inference and RL training
  • Estimated total training cost: ~$2,500 (as reported)
  • Separate GPU allocation for the frozen reward model (runs probes during RL)

Software & Dependencies

  • Gemma-3-12B-IT model weights (from Google/HuggingFace)
  • RL framework supporting ScaleRL / CISPO (see below)
  • Gemini 2.5 API access (with web search) for labeling
  • Probe training framework (PyTorch; likely simple linear classifiers on residual stream activations)

Key Reference Implementations

Component Reference arXiv
RL recipe (ScaleRL) Khatri et al. 2510.13786
CISPO loss function MiniMax-M1 paper 2506.13585
LongFact++ dataset 2509.03531
Original LongFact Wei et al. 2403.18802

5.3 Ground Truth Evaluation

  • Use Gemini 2.5 (+ web search) as the oracle judge for final evaluation
  • Evaluate factual accuracy of each claim in generated responses
  • Compare hallucination rates across configurations

5.4 Test-Time Scaling

  • Vary the number of candidate samples (N) for best-of-N at test time
  • Plot intervention success rate vs. sample count
  • Key finding to reproduce: probe signal remains useful post-training, enabling effective test-time scaling

5.5 Probe Transfer Test

  • Run reward probes (trained on base model activations) on the RL-trained model's activations
  • Verify that probes still work accurately (i.e., representations are stable enough to transfer)
  • This is a key result: if it holds, you don't need to host both models at inference time

5.6 Standard Benchmarks (Safety Check)

  • Run the RL-trained model on standard LLM benchmarks to verify no degradation
  • Check paper for specific benchmarks used (likely MMLU, HellaSwag, ARC, etc.)

Phase 6: Train-Time Scaling Analysis

  • Track hallucination reduction as a function of RL training steps
  • Plot policy reduction rate vs. optimizer steps
  • Verify monotonic improvement (the paper shows this curve)

Key Risks & Pitfalls

  1. Probe quality is foundational: If probes have poor accuracy, the entire pipeline fails. Validate probe precision/recall thoroughly before RL training.

  2. Frozen model hosting: You need to run both the student model AND a frozen copy simultaneously during training. Plan GPU memory accordingly.

  3. LongFact++ availability: If the dataset isn't publicly released, reconstruction from LongFact + expansion may introduce distributional differences.

  4. Gemini API costs for labeling: Phase 1.3 labeling can be expensive. Budget for ~20K prompts × multi-paragraph responses × fact-checking with web search.

  5. CISPO implementation details: The epsilon_high parameter and other CISPO hyperparameters matter. The MiniMax paper suggests epsilon_high ≈ 5.0 as a starting point.

  6. Inline intervention format: The exact format of corrections/retractions inserted inline (special tokens? natural language markers?) affects in-context reduction. Check paper for formatting details.


Checklist

  • Obtain Gemma-3-12B-IT weights
  • Obtain or reconstruct LongFact++ dataset (~20K prompts, 8 domains)
  • Set up Gemini 2.5 API access with web search
  • Generate on-policy rollouts from base model with activation caching
  • Label rollouts using Gemini 2.5 (entities, factuality, corrections, retractions)
  • Train 4 probes (localization, classification, correction reward, retraction reward)
  • Validate probe accuracy on held-out data
  • Build inference/monitoring pipeline
  • Implement ScaleRL + CISPO training loop
  • Implement probe-based reward computation on frozen model
  • Run RL training (~360 steps)
  • Evaluate on 999 held-out test prompts
  • Run ablations (RLFR, RLFR-NI, Base+Monitor, Base)
  • Test probe transfer (base → trained model)
  • Run standard benchmarks for degradation check
  • Analyze train-time and test-time scaling curves

References

  1. Prasad et al. "Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability." arXiv:2602.10067, 2026.
  2. Khatri et al. "The Art of Scaling Reinforcement Learning Compute for LLMs." arXiv:2510.13786, 2025. (ScaleRL)
  3. MiniMax. "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention." arXiv:2506.13585, 2025. (CISPO)
  4. Wei et al. "Long-form factuality in large language models." arXiv:2403.18802, 2024. (LongFact)
  5. arXiv:2509.03531. (LongFact++)
  6. Orgad et al. arXiv:2410.02707, 2024. (Prior work on hallucination signals in model internals)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment