Created
September 2, 2017 02:52
-
-
Save brito/833f0e7fa51cc4f6444a404923b001bd to your computer and use it in GitHub Desktop.
Concrete Problems in AI Safety
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Concrete Problems in AI Safety | |
- https://arxiv.org/pdf/1606.06565v2.pdf | |
- arXiv:1606.06565v2 [cs.AI] 25 Jul 2016 | |
Problems | |
A: wrong formal objective function | |
- negative side effects | |
+ define and learn impact regularizer | |
+ penalize influence | |
+ multi-agent approaches | |
+ reward uncertainty | |
- reward hacking | |
- partially observed goals | |
- complicated systems | |
- abstract rewards (adversarial manipulation) | |
- Goodhart's law (correlation vs causation) | |
- feedback loops | |
+ adversarial reward (peer review) | |
+ model lookahead (inertial simulation) | |
+ adversarial blinding (agent crossvalidation) | |
+ careful engineering (tests, sandbox) | |
+ reward capping (+ longer terms) | |
+ counterexample resistance (adversarial training) | |
+ multiple rewards | |
+ reward pretraining | |
+ variable indifference | |
+ trip wires | |
B: bad extrapolations from limited samples | |
- unscalable oversight | |
+ supervised reward learning | |
+ semi-supervised or active reward learning | |
+ unsupervised value iteration | |
+ unsupervised model learning | |
+ distant supervision | |
+ hierarchical reinforcement learning | |
C: poor training data and/or insufficiently expressive model | |
- unsafe exploration | |
+ risk-sensitive performance criteria | |
+ use demonstrations | |
+ simulated exploration | |
+ bounded exploration | |
+ trusted policy oversight | |
+ human oversight | |
- fragile to distributional shift | |
+ well-specified models | |
+ coviariate shift | |
+ marginal likelihood | |
+ partially specified models | |
+ method of moments | |
+ unsupervised risk estimation | |
+ causal identification | |
+ limited-information maximum likelihood | |
+ training on multiple distributions | |
+ respond when out-of-distribution | |
+ counterfactual reasoning | |
+ machine learning with contracts | |
D: related | |
- privacy | |
- fairness | |
- security | |
- abuse | |
- transparency | |
- policy | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment