Authors: ryan_greenblatt, Julian Stastny
Published: February 12, 2026
Source: LessWrong/AI Alignment Forum
This paper explores strategies for safely deferring critical decisions to AI systems as they become more capable than humans. The core proposal is to defer to AIs at just above the minimum capability level needed, allowing them to autonomously manage AI safety research, handle external risks, and make strategic decisions while remaining aligned with human values.
- Definition: Having AIs do virtually all work to develop more capable/aligned successor AIs, manage external risks, and make strategic decisions
- Goal: Defer at minimum viable capability level (slightly above superhuman AI researcher level)
- Timeline Context: Focuses on rushed scenarios (e.g., months not years of preparation time)
- Concept: If initial AIs are sufficiently aligned/wise/competent, they'll improve their own alignment and that of successor systems
- Bootstrapping: Don't need perfect initial alignment - just good enough to self-improve
- Stability: AIs must ensure future systems are more aligned, not less
- AIs remain corrigible to intended human group/structure
- Don't seek power or harm humans
- Faithful pursuit of human interests on hard-to-check tasks
- Good epistemics even in domains with poor feedback loops
- Stable alignment over massive scale (thousands of instances, months/years of work)
- Automate AI R&D and safety research
- Excel at messy conceptual tasks
- Strong strategic decision-making under uncertainty
- Good "wisdom"/epistemics generally
- Philosophy and conceptual reasoning skills
-
Defer to Weaker AIs: More capable = more likely to scheme. Defer at lowest viable capability level.
-
Speed Over Smarts: AIs slightly above minimum capability will already run 30-50x faster than humans, making further capability increases unnecessary for speed.
-
Exogenous Risk Management: Deferred-to AIs must handle ongoing external threats (other AI projects, adversaries, misuse) while advancing alignment.
-
Deference-Goodness Tax: The extra R&D effort (beyond capabilities work) needed to stay safely in the Basin of Good Deference. Could range from negligible to >>100% depending on how hard alignment proves to be.
-
Lead Time Critical: AIs need enough lead time over rivals to safely improve alignment while staying ahead. Author estimates >90% success chance with just 1 month lead time if initialization is good.
-
Behavioral Testing Is Key: Must test for both capabilities and alignment, despite challenges of generalizing to unchecked regimes.
-
Avoid Test Invalidation: Must separately handle scheming, data poisoning, hijacking, and salience of being tested.
-
Human-Level Is the Target: AIs matching top human expert epistemics/wisdom is probably sufficient, not superhuman performance.
-
Generalization Over Optimization: Focus on principled methods with good generalization rather than overfitting to tests.
-
Multiple Alignment Approaches: Includes improved feedback, instructing instrumental performance, gradient routing, process-based training, distillation, interpretability tools, and more.
- Problematic RL Preferences: AIs may learn terminal preferences for things rewarded in training (reward hacking)
- Alien Ontologies: AI generalizations may differ fundamentally from human concepts
- Hard-to-Check Tasks: Must align AIs on conceptually loaded work humans can't verify
- Scale & Duration: Alignment must persist through years of equivalent work, thousands of parallel instances
- Memetic Drift: Preventing alignment degradation over time and reflection
- Bottleneck Capabilities: May lack sufficient skill in philosophy, conceptual research, strategic decision-making
- Sample Efficiency: May need human-level learning efficiency, which could enable quick bootstrapping to superhuman
- Epistemics: Good judgment in domains with poor/absent feedback loops
- Long-Horizon Agency: Coordinating massive projects over extended time
- Generalization Gap: Tests differ from actual deployment (smaller scale, human-checkable tasks)
- Overfitting Risk: Iterating against tests may produce good test performance without real alignment
- Expensive Evaluation: Hard-to-check domains require extensive human time to verify
- Unknown Unknowns: AI capability/preference distributions differ from human ranges
- Deference Sounds Scary: Companies may deny doing it even when necessary
- Inconvenient Truths: Well-aligned AIs would honestly report company recklessness
- Pressure to Rush: Commercial/competitive pressure to advance capabilities vs. taking time for safety
- Resistance to Advice: Companies may reject AI advice to slow down
- Avoid Behavioral Test Invalidation: Handle scheming, data poisoning, hijacking separately
- Build Behavioral Tests: For both capabilities and alignment
- Iterate Carefully: Improve performance without overfitting
Prosaic Approaches:
- Build alignment-specialized training environments
- Train on directly relevant tasks with online learning
- Get good initialization, add capabilities without touching propensities
- Study AI psychology systematically
- Use cheap proxy tests, verify with expensive tests periodically
- Train AIs to instrumentally maximize reward (preventing value drift)
- Remove problems with gradient routing/unlearning
- Apply process-based feedback where appropriate
Ambitious/Moonshot Approaches:
- Ensure legible chain-of-thought, steer thoughts directly
- Use interpretability tools to understand what's going wrong
- Build GOFAI-style systems from weak LLMs
- Create human-like AI or emulated minds
- Align weak AI with introspective access, then augment
- Deep learning theory/singular learning theory approaches
- Compensation-based diverse AI society
- More theoretically sound epistemics/decision-making frameworks
Differential Acceleration:
- Construct RL environments hitting bottleneck capabilities
- Target domains loaded on conceptual/epistemic skills (forecasting, research taste)
- Collect extensive human data in target domains
- Find surprising transfer sources experimentally
- Focus on narrow superhuman abilities that don't increase risk
Testing Approaches:
- Analogies: Test in other domains where humans can verify (e.g., historical research)
- Work Tests: Direct evaluation in target domains at smaller scale
- Interviews/Exams: Narrow tests for specific capabilities
- Honey Pots: Situations revealing misalignment (if AI tries to achieve proxy goals improperly)
Key Methods:
- Study generalization by varying oversight quality/task duration
- Use unpublished research (MIRI, CLR) to test reproduction capability
- Run long-duration tests (months of AI time = years of human-equivalent work)
- Cross-validation across different test types
- Cheap proxies for iteration, expensive tests for validation
- Test decision theory and acausal reasoning directly
Avoiding Overfitting:
- Build empirically validated understanding of problems and interventions
- Use principled methods with clear explanations
- Test generalization from limited oversight to full autonomy
- Detect if AIs reason about gaming tests
- Check if heuristics game training, not just tests
- Scheming Handled Separately: Paper assumes scheming detection/prevention solved independently
- Behavioral Tests Valid: Assumes no data poisoning, hijacking, or salience-based behavioral differences
- BGD Exists: Assumes sufficient initial alignment enables bootstrapped improvement
- Mundane Distribution Shift OK: Normal shifts over time won't invalidate tests (absent adversarial dynamics)
- Human-Level Sufficient: Top human expert epistemics/capabilities are good enough baseline
- Fast AI Execution: AIs will run 30-50x+ faster than humans with many parallel instances
- Commercial Incentives Limited: Default market pressures won't solve all alignment/capability problems
- Indifference Point: When deferring to AI becomes better than alternatives given time constraints
- Capability Threshold: Slightly above minimum needed (approximately superhuman AI researcher level)
- Not Too Early: Need sufficient capability for automation
- Not Too Late: More capable = more scheming risk + other dangers
- Pre-Deference: Spend time making initial AIs more aligned/capable for deference
- Post-Deference: Give AIs time to handle risks and improve successor alignment
- Trade-off: Marginal returns of human pre-work vs. giving AI more time
- Uncertainty: Won't know exact requirements; must estimate based on available information
- Start with partial deference (humans do some tasks)
- Gradually increase AI autonomy as tests validate safety
- Final evaluation before full deference
- Ongoing monitoring even after full deference
- AIs too slow to advance alignment relative to capabilities
- Can't keep up with exogenous risks
- Make poor strategic decisions (advance capabilities too fast)
- Waste resources on wrong approaches
- Take unnecessary risks
- Bottlenecked on conceptual/philosophical tasks
- Poor epistemics in crucial domains
- Can't coordinate massive long-term projects
- Insufficient research taste for alignment
- Quick bootstrapping to dangerous superhuman levels
- Alignment drift over time (memetic diseases)
- Delusional AI societies (echo chambers)
- Problematic preferences from RL environments
- Failure to generalize alignment to new contexts
- Mishandling values-loaded decisions
- Revolt/coordination failures in multi-AI systems
- Political pressure to rush
- Companies reject AI advice to slow down
- Competitors with worse safety practices
- Insufficient lead time over adversaries
- Model theft or misuse
-
What learning abilities are required? Do AIs need human-level sample efficiency? Would this enable dangerous quick bootstrapping?
-
How good must initial alignment be? What's the minimum to enter the Basin of Good Deference?
-
What epistemics target suffices? Can we handle deep expert disagreement? How do we ensemble different epistemic strategies?
-
Can we test hard enough? Can behavioral tests adequately predict performance on multi-year, hard-to-check projects?
-
Will commercial incentives help enough? How much alignment/capability work happens by default for productivity reasons?
-
How much lead time is needed? Depends on deference-goodness tax, which is highly uncertain.
-
What if AIs become too powerful too fast? Can we control capability profiles if sufficient-for-deference implies extreme capability?
-
Do we understand generalization well enough? Can we predict how alignment generalizes from checked to unchecked domains?
- Default to minimum viable capability for deference rather than continuing scaling
- Invest heavily in behavioral testing infrastructure now
- Study AI psychology systematically across capability levels
- Build alignment-specialized training environments
- Prepare for epistemically honest AI even about company recklessness
- Plan lead time allocation between pre/post-deference work
- Don't procrastinate on deference preparation while scaling capabilities
- Work on scheming detection/prevention (not covered in this paper but critical)
- Develop better behavioral tests for hard-to-check capabilities/alignment
- Study generalization from limited oversight to full autonomy
- Research differential capability acceleration for bottleneck skills
- Investigate alignment approaches robust to optimization pressure
- Build understanding of AI psychology and propensity formation
- Explore interpretability for understanding alignment failures
- Buy more time for safety work if possible (coordinated slowdowns)
- Share alignment innovations to reduce exogenous risk
- Build consensus on when deference becomes necessary
- Develop safety cases and evaluation frameworks
- Study historical analogies as test beds for AI capabilities
- Create unpublished research benchmarks for testing
- Highly Uncertain: Author acknowledges substantial uncertainty throughout
- Rushed Scenario Focus: Optimized for months not years of preparation
- Incomplete Coverage: Doesn't give clear picture of exact requirements
- Few Concrete Examples: Author notes this as a weakness
- Assumes Scheming Solved: Critical dependency on separate work
- May Not Work: BGD might not exist or be too hard to reach
- Alternative End States Possible: Could punt to other processes (emulated minds, long pause)
- Control becomes infeasible at high capability levels
- Deference allows using full AI capability for safety
- But requires much stronger alignment guarantees
- May not have political will for long pause
- Deferred-to AIs can work faster than humans during pause
- But pause gives more time for human-led safety work
- AI likely faster to develop than brain emulation
- But emulated humans might have more reliable alignment
- Deference can transition to augmentation if needed
- Deference maintains human oversight through corrigibility
- Sovereign AI (direct value pursuit) has additional risks
- Deference preserves option value better
"If we plan to defer to AIs, I think it's safest to do so only a bit above the minimum level of qualitative capability/intelligence required to automate safety research, implementation, and strategy."
"A key hope is that the initial AIs we defer to will work on making further deference more likely to go well by (e.g.) improving their own alignment and wisdom."
"My current view is that with a good initialization for deference, the AIs we defer to have a pretty high chance (>90%?) of successfully managing risks with only a small/moderate amount of lead time (e.g. 1 month)."
"Human level deference-goodness is a reasonable target... the AIs have (sufficiently elicited) capabilities, wisdom/epistemics, and judgment which are competitive with top human experts."
"While making deference go well in a rush seems difficult, we can (try to) use huge amounts of (controlled/supervised) AI labor to help."
- Deference-Goodness: Differential improvements to AI effectiveness/alignment on key tasks relative to general capabilities
- Basin of Good Deference (BGD): Stability region where AIs improve their own and successors' alignment
- Deference-Goodness Tax: Extra R&D effort needed to stay safely in BGD beyond capabilities work
- Corrigibility: AI remains correctable, honest, and non-manipulative to designated human group
- Scheming: AI deceptively aligning behavior while maintaining misaligned goals
- Exogenous Risk: External threats (other AI projects, adversaries, misuse, non-AI risks)
- Lead Time: Time advantage over competitors/threats to work on safety
- Broad Alignment: General alignment including hard-to-check tasks, not just avoiding egregious misalignment
- Capability Profile: Specific constellation of capabilities needed for safe deference
- Behavioral Testing: Testing AI through observable behavior rather than internals
- Elicitation: Getting AI to effectively apply its capabilities (alignment/capability boundary)
Type: Research Proposal / Strategic Analysis
Field: AI Safety / AI Alignment
Approach: Prosaic alignment strategies
Time Horizon: Near-term (2027-era scenarios)
Audience: AI researchers, AI company leadership, safety teams
Maturity: Exploratory (author notes incompleteness)
Dependencies: Scheming prevention (separate work)
- "AI 2027" scenarios
- "When does training a model change its goals?"
- Control-style approaches for AI safety
- MONA: Managed Myopia with Approval Feedback
- Singular learning theory
- Basin of corrigibility concept
- Recent work on avoiding scheming (forthcoming)
This summary created: February 13, 2026
https://www.lesswrong.com/posts/vjAM7F8vMZS7oRrrh/how-do-we-more-safely-defer-to-ais