Short Summary of AI Alignment

The following is a short summary of AI alignment that you may find handy.

Imagine a maid robot with which we are interacting.

Outer alignment problem, aka Reward hacking, task misspecification, specification gaming.

You ask for a coffee. It understood the assignment, but grabbed it from your father and gave it to you. You got the coffee, but that is not how you want it.
Problem: Your values and preferences are not encoded.
Challenging part: How to specify innumerably many preferences and ensure they are adhered?
Methods: Tune it to be honest, harmless and helpful: RLHF. Feedback at scale for super-intelligence: Scalable oversight, weak-to-strong generalisation, super-alignment. Explain the process instead of simply specifying the outcome: process-based feedback.
Inner alignment problem, aka goal misgeneralisation, spurious correlations, distribution shift

You ask for a coffee. It misunderstood the assignment from its experience and instead (a) gave you a cup of hot milk (goal misgeneralisation), or otherwise (b) failed because it cannot operate an unseen coffee machine (capability misgeneralisation).
Problem: Training with sparse feedback (reward or label) leaves to imagination its causes.
Challenging part: How to attribute the reward to the appropriate feature/action while keeping the feedback sparse?
Methods: Many classic methods to tackle distribution shifts (causal learning, domain generalisation, learning from explanations etc.), Interpretability methods to weed out problematic concepts.
Existential risk (hypothesised)

You ask for a coffee. It gets you one. But in the free time, builds strategies for long-horizon reward accumulation: (1) ensure humanity never runs out of coffee, (2) the bot is irreplaceable.
Problem: Extreme case of outer-(mis)alignment.
Challenge: same as outer-alignment. Also, how to monitor and control the true intentions of a learning system?
Methods: Any outer-alignment method, my personal favorite is process-based feedback.
Grounding the common terms with their technical causes
1. deceptively-aligned — hard to detect failures of complex system
2. situationally-aware policies — train-test distribution shift of policies
3. manipulative — can provide convincing explanation even for a wrong answer
4. power-seeking — outcome-based feedback (inadvertently) make actions that guarantee perpetuation more desirable.
I fear that by describing the agent behaviour using such terms as above is portraying the technical problem as some kind of a rehabilitation program. Misalignment is an engineering challenge that I believe we can solve.

Summary.

Alignment is not a new problem or necessarily require super-intelligence. Alignment is an outcome of black-box models and input high-dimensionality.
However, increased capability of systems may lead to increased autonomy thereby triggering even greater concern.
The expected scenarios of AI takeover require very long-range planning, much higher capabilities (not sure on what all), strong harm-causing motive (read very poorly designed reward), strong persuasion. While I appreciate the risks of misalignment, harms to the extent of existential risk are still unsubstantiated.

References or Further Reading

https://80000hours.org/articles/what-could-an-ai-caused-existential-catastrophe-actually-look-like/
Contains a list of extreme risks due to superintelligence with somewhat more concrete scenarios.
"The alignment problem from a deep learning perspective" http://arxiv.org/abs/2209.00626
Longer summary of AI Alignment with many references.

vihari/ai_alignment.md