Skip to content

Instantly share code, notes, and snippets.

@brito
Created September 2, 2017 02:52
Show Gist options
  • Save brito/833f0e7fa51cc4f6444a404923b001bd to your computer and use it in GitHub Desktop.
Save brito/833f0e7fa51cc4f6444a404923b001bd to your computer and use it in GitHub Desktop.
Concrete Problems in AI Safety
Concrete Problems in AI Safety
- https://arxiv.org/pdf/1606.06565v2.pdf
- arXiv:1606.06565v2 [cs.AI] 25 Jul 2016
Problems
A: wrong formal objective function
- negative side effects
+ define and learn impact regularizer
+ penalize influence
+ multi-agent approaches
+ reward uncertainty
- reward hacking
- partially observed goals
- complicated systems
- abstract rewards (adversarial manipulation)
- Goodhart's law (correlation vs causation)
- feedback loops
+ adversarial reward (peer review)
+ model lookahead (inertial simulation)
+ adversarial blinding (agent crossvalidation)
+ careful engineering (tests, sandbox)
+ reward capping (+ longer terms)
+ counterexample resistance (adversarial training)
+ multiple rewards
+ reward pretraining
+ variable indifference
+ trip wires
B: bad extrapolations from limited samples
- unscalable oversight
+ supervised reward learning
+ semi-supervised or active reward learning
+ unsupervised value iteration
+ unsupervised model learning
+ distant supervision
+ hierarchical reinforcement learning
C: poor training data and/or insufficiently expressive model
- unsafe exploration
+ risk-sensitive performance criteria
+ use demonstrations
+ simulated exploration
+ bounded exploration
+ trusted policy oversight
+ human oversight
- fragile to distributional shift
+ well-specified models
+ coviariate shift
+ marginal likelihood
+ partially specified models
+ method of moments
+ unsupervised risk estimation
+ causal identification
+ limited-information maximum likelihood
+ training on multiple distributions
+ respond when out-of-distribution
+ counterfactual reasoning
+ machine learning with contracts
D: related
- privacy
- fairness
- security
- abuse
- transparency
- policy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment