- Broad trends: Graph networks, Neuro-symbolics, Hierarchies
- Talks available at https://aaai.org/Conferences/AAAI-20/livestreamed-talks/
- Stuart Russel
- Lesson: Human instructions are uncertain. Minimally invasive preferences. Do not convert everything else into paper clips. Keep "everything else" constant unless asked for.
- Geoffrey Hinton:
- 3D Transformation must be baked into the neural network. Customize generators. Use set transformers.
- Read papers: http://akosiorek.github.io/ml/2019/06/23/stacked_capsule_autoencoders.html
- https://arxiv.org/abs/1810.00825
- Yan Lecun
- Self supervised learning is going to save the day
- Yoshua Bengio
- Consciousness is bottleneck principle. Look at Recurrent independent mechanisms
- https://arxiv.org/abs/1909.10893
- David Cox talk
- Neuro symbolic concept learning http://nscl.csail.mit.edu/
-
An Intrinsically-‐Motivated Approach for Learning Highly Exploring and Fast Mixing Policies
- minimizing || (I - ΠP) . 1 ||₂ is a proxy for maximizing entropy
- https://arxiv.org/pdf/1907.04662.pdf
-
[-] Reinforcement Learningof Risk-‐Constrained Policies in Markov Decision Processes Tomáš Brázdil, Krishnendu Chatterjee, Petr Novotný, Jiří Vahala
- https://www.fi.muni.cz/~xnovot18/aaai20.pdf Risk constrained Alpha Go (MCTS). Uses Linear programming to bound the risks on the leaf nodes and propagated them up the monte carlos tree datastructure.
- https://ieeexplore.ieee.org/abstract/document/6750726
- http://papers.nips.cc/paper/8032-a-lyapunov-based-approach-to-safe-reinforcement-learning.pdf
- https://statweb.stanford.edu/~ljanson/papers/Risk_Constrained_Reinforcement_Learning-Chow_ea-2016.pdf
-
Few-‐Shot Bayesian Imitation Learning with Logical Program Policies
- https://arxiv.org/pdf/1904.06317.pdf
- symbolic algebra with features
-
Off-Policy Evaluation in Partially Observable Environments Guy Tennenholtz, Shie Mannor, Uri Shalit https://arxiv.org/abs/1909.03739
- " Our work sits at an intersection between the fields of RL and Causal Inference."
- "In Decoupled POMDPs, observed and unobserved states are separated into two distinct processes, with a coupling between them at each time step."
- "we demonstrate the use of a well-known approach, Importance Sampling (IS): a reweighting of rewards generated by the be- havior policy, π b , such that they are equivalent to unbiased rewards from an evaluation policy π e ."
-
Deep Conservative Policy Iteration
- Nino Vieillard, Olivier Pietquin, Matthieu Geist
- https://arxiv.org/abs/1906.09784
- Another algorithm that offers a different loss function from DQN.
-
Deep Model-‐Based Reinforcement Learning via Estimated Uncertainty
and Conservative Policy Optimization- https://arxiv.org/pdf/1911.12574v1.pdf
- Estimate uncertainty using ensembles.
- Upper bound the variance of [Qₛₐʰ]
-
Querying to Find a Safe Policy Under Uncertain Safety Constraints in Markov Decision Process Shun Zhang, Edmund H. Durfee, Satinder Singh
- http://web.eecs.umich.edu/~baveja/Papers/AAAI20-Shun.pdf
- https://web.eecs.umich.edu/~baveja/Papers/ijcai-2018.pdf
- "An agent that can be trusted to operate safely should thus only change features the user has explicitly permitted."
- "It is easier to use linear programming (de Farias and Van Roy 2003) than dynamic programming methods like value itera- tion to find a safely-optimal policy because we can easily add constraints to it."
- Do not convert everything into paper-clips if asked for a coffee.
-
Policy Search by Target Distribution Learning for Continuous Control
- Chuheng Zhang Yuanqi Li Jian Li
- https://arxiv.org/abs/1905.11041
- "It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to determin- istic, leading to an unstable training process. We show that such instability can happen even in a very simple environment."
-
Tree-‐Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video
- Jie Wu 1 , Guanbin Li 1∗ , Si Liu 2 , Liang Lin
- http://colalab.org/media/paper/AAAI2020-Tree-Structured.pdf
- "Inspired by hu- man’s coarse-to-fine decision-making paradigm, we design a tree-structured policy to decompose complex action poli- cies and propose a more reasonable primitive action via two- stages selection, instead of using a flat policy that maps the state feature to action directly (He et al. 2019). As shown in the right half of Figure 2, the tree-structured policy consist- s of a root policy and a leaf policy at each time step. The root policy π r (a rt |s t ) decides which semantic branch will be primarily relied on. The leaf policy π l (a lt |s t , a rt ) consists of five sub-policies, which corresponds to five high-level semantic branches."
-
Gradient-‐Aware Model-‐based Policy Search
- Pierluca D'Oro, Alberto Maria Metelli, Andrea Tirinzoni, Matteo Papini, Marcello Restelli
- https://arxiv.org/abs/1909.04115
- propagate environment model gradients based on weights based on policy gradients
-
Deterministic Value-‐Policy Gradients
- https://deepai.org/publication/deterministic-value-policy-gradients
- Instead of just policy gradients as in DDPG use both Value and policy gradients.
-
Safe Linear Stochastic Bandits
- https://arxiv.org/pdf/1911.09501.pdf
- "the learner is required to select an arm with an expected reward that is no less than a predetermined (safe) threshold with high probability"
- P (<Xₜ,θ*> ≥ b) ≥ 1 − δ,
-
Planning and Acting with Non-‐deterministic Events: Navigating between Safe States
-
NeoNav: Improving the Generalization of Visual Navigation via Generating Next Expected Observations
- https://arxiv.org/abs/1906.07207
- Predict next visual observations using VAEs to improve visual navigation.
-
PsyNet: Self-‐supervised Approach to Object Localization using Point Symmetric Transformation
- What is a point symmetric transformation
- https://github.com/FriedRonaldo/PsyNet
-
Learning and Reasoning for Robot Sequential Decision Making under Uncertainty
- https://arxiv.org/abs/1901.05322
- "In experi- ments, a mobile robot is tasked with estimating human in- tentions using their motion trajectories, declarative contex- tual knowledge, and human-robot interaction (dialog-based and motion-based)"
-
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
- https://arxiv.org/abs/1909.02769
- Another method using rollout of policy trees for faster convergence.
-
Specifying Weight Priors in Bayesian Deep Neural Networks with Empirical Bayes
- https://arxiv.org/abs/1906.05323
- Getting weight priors based on mutual information between parameter posterior distribution and predictive distribution.
-
Collaborative sampling for Generative Adversarial Networks
- https://arxiv.org/abs/1902.00813
- Utilizing discriminator for more informed sampling.