How Do We (More) Safely Defer to AIs? - Summary

Authors: ryan_greenblatt, Julian Stastny
Published: February 12, 2026
Source: LessWrong/AI Alignment Forum

Executive Summary

This paper explores strategies for safely deferring critical decisions to AI systems as they become more capable than humans. The core proposal is to defer to AIs at just above the minimum capability level needed, allowing them to autonomously manage AI safety research, handle external risks, and make strategic decisions while remaining aligned with human values.

Key Categories

1. Core Concept: AI Deference

Definition: Having AIs do virtually all work to develop more capable/aligned successor AIs, manage external risks, and make strategic decisions
Goal: Defer at minimum viable capability level (slightly above superhuman AI researcher level)
Timeline Context: Focuses on rushed scenarios (e.g., months not years of preparation time)

2. Basin of Good Deference (BGD)

Concept: If initial AIs are sufficiently aligned/wise/competent, they'll improve their own alignment and that of successor systems
Bootstrapping: Don't need perfect initial alignment - just good enough to self-improve
Stability: AIs must ensure future systems are more aligned, not less

3. Two Main Requirements

A. Broad Alignment

AIs remain corrigible to intended human group/structure
Don't seek power or harm humans
Faithful pursuit of human interests on hard-to-check tasks
Good epistemics even in domains with poor feedback loops
Stable alignment over massive scale (thousands of instances, months/years of work)

B. Capability Profile

Automate AI R&D and safety research
Excel at messy conceptual tasks
Strong strategic decision-making under uncertainty
Good "wisdom"/epistemics generally
Philosophy and conceptual reasoning skills

Key Takeaways

Strategic Insights

Defer to Weaker AIs: More capable = more likely to scheme. Defer at lowest viable capability level.
Speed Over Smarts: AIs slightly above minimum capability will already run 30-50x faster than humans, making further capability increases unnecessary for speed.
Exogenous Risk Management: Deferred-to AIs must handle ongoing external threats (other AI projects, adversaries, misuse) while advancing alignment.
Deference-Goodness Tax: The extra R&D effort (beyond capabilities work) needed to stay safely in the Basin of Good Deference. Could range from negligible to >>100% depending on how hard alignment proves to be.
Lead Time Critical: AIs need enough lead time over rivals to safely improve alignment while staying ahead. Author estimates >90% success chance with just 1 month lead time if initialization is good.

Technical Approaches

Behavioral Testing Is Key: Must test for both capabilities and alignment, despite challenges of generalizing to unchecked regimes.
Avoid Test Invalidation: Must separately handle scheming, data poisoning, hijacking, and salience of being tested.
Human-Level Is the Target: AIs matching top human expert epistemics/wisdom is probably sufficient, not superhuman performance.
Generalization Over Optimization: Focus on principled methods with good generalization rather than overfitting to tests.
Multiple Alignment Approaches: Includes improved feedback, instructing instrumental performance, gradient routing, process-based training, distillation, interpretability tools, and more.

Main Challenges

Alignment Challenges

Problematic RL Preferences: AIs may learn terminal preferences for things rewarded in training (reward hacking)
Alien Ontologies: AI generalizations may differ fundamentally from human concepts
Hard-to-Check Tasks: Must align AIs on conceptually loaded work humans can't verify
Scale & Duration: Alignment must persist through years of equivalent work, thousands of parallel instances
Memetic Drift: Preventing alignment degradation over time and reflection

Capability Challenges

Bottleneck Capabilities: May lack sufficient skill in philosophy, conceptual research, strategic decision-making
Sample Efficiency: May need human-level learning efficiency, which could enable quick bootstrapping to superhuman
Epistemics: Good judgment in domains with poor/absent feedback loops
Long-Horizon Agency: Coordinating massive projects over extended time

Testing Challenges

Generalization Gap: Tests differ from actual deployment (smaller scale, human-checkable tasks)
Overfitting Risk: Iterating against tests may produce good test performance without real alignment
Expensive Evaluation: Hard-to-check domains require extensive human time to verify
Unknown Unknowns: AI capability/preference distributions differ from human ranges

Political/Organizational Challenges

Deference Sounds Scary: Companies may deny doing it even when necessary
Inconvenient Truths: Well-aligned AIs would honestly report company recklessness
Pressure to Rush: Commercial/competitive pressure to advance capabilities vs. taking time for safety
Resistance to Advice: Companies may reject AI advice to slow down

Proposed Solutions & Strategies

High-Level Strategy

Avoid Behavioral Test Invalidation: Handle scheming, data poisoning, hijacking separately
Build Behavioral Tests: For both capabilities and alignment
Iterate Carefully: Improve performance without overfitting

Improving Alignment

Prosaic Approaches:

Build alignment-specialized training environments
Train on directly relevant tasks with online learning
Get good initialization, add capabilities without touching propensities
Study AI psychology systematically
Use cheap proxy tests, verify with expensive tests periodically
Train AIs to instrumentally maximize reward (preventing value drift)
Remove problems with gradient routing/unlearning
Apply process-based feedback where appropriate

Ambitious/Moonshot Approaches:

Ensure legible chain-of-thought, steer thoughts directly
Use interpretability tools to understand what's going wrong
Build GOFAI-style systems from weak LLMs
Create human-like AI or emulated minds
Align weak AI with introspective access, then augment
Deep learning theory/singular learning theory approaches
Compensation-based diverse AI society
More theoretically sound epistemics/decision-making frameworks

Improving Capability Profile

Differential Acceleration:

Construct RL environments hitting bottleneck capabilities
Target domains loaded on conceptual/epistemic skills (forecasting, research taste)
Collect extensive human data in target domains
Find surprising transfer sources experimentally
Focus on narrow superhuman abilities that don't increase risk

Testing Approaches:

Analogies: Test in other domains where humans can verify (e.g., historical research)
Work Tests: Direct evaluation in target domains at smaller scale
Interviews/Exams: Narrow tests for specific capabilities
Honey Pots: Situations revealing misalignment (if AI tries to achieve proxy goals improperly)

Behavioral Testing Strategies

Key Methods:

Study generalization by varying oversight quality/task duration
Use unpublished research (MIRI, CLR) to test reproduction capability
Run long-duration tests (months of AI time = years of human-equivalent work)
Cross-validation across different test types
Cheap proxies for iteration, expensive tests for validation
Test decision theory and acausal reasoning directly

Avoiding Overfitting:

Build empirically validated understanding of problems and interventions
Use principled methods with clear explanations
Test generalization from limited oversight to full autonomy
Detect if AIs reason about gaming tests
Check if heuristics game training, not just tests

Critical Assumptions

Scheming Handled Separately: Paper assumes scheming detection/prevention solved independently
Behavioral Tests Valid: Assumes no data poisoning, hijacking, or salience-based behavioral differences
BGD Exists: Assumes sufficient initial alignment enables bootstrapped improvement
Mundane Distribution Shift OK: Normal shifts over time won't invalidate tests (absent adversarial dynamics)
Human-Level Sufficient: Top human expert epistemics/capabilities are good enough baseline
Fast AI Execution: AIs will run 30-50x+ faster than humans with many parallel instances
Commercial Incentives Limited: Default market pressures won't solve all alignment/capability problems

Timeline & Decision Points

When to Defer?

Indifference Point: When deferring to AI becomes better than alternatives given time constraints
Capability Threshold: Slightly above minimum needed (approximately superhuman AI researcher level)
Not Too Early: Need sufficient capability for automation
Not Too Late: More capable = more scheming risk + other dangers

Lead Time Allocation

Pre-Deference: Spend time making initial AIs more aligned/capable for deference
Post-Deference: Give AIs time to handle risks and improve successor alignment
Trade-off: Marginal returns of human pre-work vs. giving AI more time
Uncertainty: Won't know exact requirements; must estimate based on available information

Continuous Transition

Start with partial deference (humans do some tasks)
Gradually increase AI autonomy as tests validate safety
Final evaluation before full deference
Ongoing monitoring even after full deference

Risks & Failure Modes

From Insufficient Deference-Goodness

AIs too slow to advance alignment relative to capabilities
Can't keep up with exogenous risks
Make poor strategic decisions (advance capabilities too fast)
Waste resources on wrong approaches
Take unnecessary risks

From Capability Profile Issues

Bottlenecked on conceptual/philosophical tasks
Poor epistemics in crucial domains
Can't coordinate massive long-term projects
Insufficient research taste for alignment
Quick bootstrapping to dangerous superhuman levels

From Alignment Issues

Alignment drift over time (memetic diseases)
Delusional AI societies (echo chambers)
Problematic preferences from RL environments
Failure to generalize alignment to new contexts
Mishandling values-loaded decisions
Revolt/coordination failures in multi-AI systems

From External Factors

Political pressure to rush
Companies reject AI advice to slow down
Competitors with worse safety practices
Insufficient lead time over adversaries
Model theft or misuse

Open Questions

What learning abilities are required? Do AIs need human-level sample efficiency? Would this enable dangerous quick bootstrapping?
How good must initial alignment be? What's the minimum to enter the Basin of Good Deference?
What epistemics target suffices? Can we handle deep expert disagreement? How do we ensemble different epistemic strategies?
Can we test hard enough? Can behavioral tests adequately predict performance on multi-year, hard-to-check projects?
Will commercial incentives help enough? How much alignment/capability work happens by default for productivity reasons?
How much lead time is needed? Depends on deference-goodness tax, which is highly uncertain.
What if AIs become too powerful too fast? Can we control capability profiles if sufficient-for-deference implies extreme capability?
Do we understand generalization well enough? Can we predict how alignment generalizes from checked to unchecked domains?

Practical Recommendations

For AI Companies

Default to minimum viable capability for deference rather than continuing scaling
Invest heavily in behavioral testing infrastructure now
Study AI psychology systematically across capability levels
Build alignment-specialized training environments
Prepare for epistemically honest AI even about company recklessness
Plan lead time allocation between pre/post-deference work
Don't procrastinate on deference preparation while scaling capabilities

For Researchers

Work on scheming detection/prevention (not covered in this paper but critical)
Develop better behavioral tests for hard-to-check capabilities/alignment
Study generalization from limited oversight to full autonomy
Research differential capability acceleration for bottleneck skills
Investigate alignment approaches robust to optimization pressure
Build understanding of AI psychology and propensity formation
Explore interpretability for understanding alignment failures

For the Field

Buy more time for safety work if possible (coordinated slowdowns)
Share alignment innovations to reduce exogenous risk
Build consensus on when deference becomes necessary
Develop safety cases and evaluation frameworks
Study historical analogies as test beds for AI capabilities
Create unpublished research benchmarks for testing

Important Caveats & Limitations

Highly Uncertain: Author acknowledges substantial uncertainty throughout
Rushed Scenario Focus: Optimized for months not years of preparation
Incomplete Coverage: Doesn't give clear picture of exact requirements
Few Concrete Examples: Author notes this as a weakness
Assumes Scheming Solved: Critical dependency on separate work
May Not Work: BGD might not exist or be too hard to reach
Alternative End States Possible: Could punt to other processes (emulated minds, long pause)

Comparison to Alternatives

vs. Control (Without Deference)

Control becomes infeasible at high capability levels
Deference allows using full AI capability for safety
But requires much stronger alignment guarantees

vs. Pause/Delay

May not have political will for long pause
Deferred-to AIs can work faster than humans during pause
But pause gives more time for human-led safety work

vs. Human Intelligence Augmentation

AI likely faster to develop than brain emulation
But emulated humans might have more reliable alignment
Deference can transition to augmentation if needed

vs. Fully Autonomous Safety AI

Deference maintains human oversight through corrigibility
Sovereign AI (direct value pursuit) has additional risks
Deference preserves option value better

Key Quotes

"If we plan to defer to AIs, I think it's safest to do so only a bit above the minimum level of qualitative capability/intelligence required to automate safety research, implementation, and strategy."

"A key hope is that the initial AIs we defer to will work on making further deference more likely to go well by (e.g.) improving their own alignment and wisdom."

"My current view is that with a good initialization for deference, the AIs we defer to have a pretty high chance (>90%?) of successfully managing risks with only a small/moderate amount of lead time (e.g. 1 month)."

"Human level deference-goodness is a reasonable target... the AIs have (sufficiently elicited) capabilities, wisdom/epistemics, and judgment which are competitive with top human experts."

"While making deference go well in a rush seems difficult, we can (try to) use huge amounts of (controlled/supervised) AI labor to help."

Related Concepts & Terms

Deference-Goodness: Differential improvements to AI effectiveness/alignment on key tasks relative to general capabilities
Basin of Good Deference (BGD): Stability region where AIs improve their own and successors' alignment
Deference-Goodness Tax: Extra R&D effort needed to stay safely in BGD beyond capabilities work
Corrigibility: AI remains correctable, honest, and non-manipulative to designated human group
Scheming: AI deceptively aligning behavior while maintaining misaligned goals
Exogenous Risk: External threats (other AI projects, adversaries, misuse, non-AI risks)
Lead Time: Time advantage over competitors/threats to work on safety
Broad Alignment: General alignment including hard-to-check tasks, not just avoiding egregious misalignment
Capability Profile: Specific constellation of capabilities needed for safe deference
Behavioral Testing: Testing AI through observable behavior rather than internals
Elicitation: Getting AI to effectively apply its capabilities (alignment/capability boundary)

Document Classification

Type: Research Proposal / Strategic Analysis
Field: AI Safety / AI Alignment
Approach: Prosaic alignment strategies
Time Horizon: Near-term (2027-era scenarios)
Audience: AI researchers, AI company leadership, safety teams
Maturity: Exploratory (author notes incompleteness)
Dependencies: Scheming prevention (separate work)

Component	Reference	arXiv
RL recipe (ScaleRL)	Khatri et al.	2510.13786
CISPO loss function	MiniMax-M1 paper	2506.13585
LongFact++ dataset	—	2509.03531
Original LongFact	Wei et al.	2403.18802

bigsnarfdude/greenblatt.md