ugo-nama-kun/visual_rl.md

Last active October 3, 2021 11:34

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ugo-nama-kun/586f6d7683a2b79d6c42fba0277d8fb3.js"></script>
Save ugo-nama-kun/586f6d7683a2b79d6c42fba0277d8fb3 to your computer and use it in GitHub Desktop.

Download ZIP

連続行動＋視覚入力を使った深層強化学習まとめ

Raw

visual_rl.md

Deep RL + Continuous Control with Vision

深層強化学習で 連続行動 と 視覚入力 を使ったものをまとめる
特に重要なテクニックが書かれていればそれも書き出す
マルチモーダルな強化学習もあれば書いておく

まとめた後の画像エージェントの構成パターン

SAC のような形で、完全に actor と critic でネットワークを分けて CNN を2つ利用する
actor と critic で CNN は共有するが、CNNの更新はcriticでのみしてactorはそれを利用する
actor と critic で CNN を利用するが、CNNの更新はAuto encoderなど別のLossをつかう

Author

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. PMLR, 2016.

A3Cの論文

TORCS カーシミュレータ

Softmax で行動を離散化して学習
8x8x16-4x4x32-256-(action, policy)の分岐アーキテクチャ

Mujoco (pendulum, pointmass2D, and gripper)

RGB入力の実験もある
２段の conv layer（プーリング・非線形活性化なし）。128 LSTMセルに入力。
行動出力は mean と variance を出力。variance は softplus
value network と policy network は完全に別で、パラメータをシェアしない（パフォーマンスにクリティカルに効かないと書いているが、たぶん効いてる） = 分岐アーキテクチャではない

Author

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Haarnoja, Tuomas, et al. "Soft actor-critic algorithms and applications." arXiv preprint arXiv:1812.05905 (2018).

SAC の応用論文

マニピュレーションタスクを画像で解かせている

3x3x4(relu?)-maxpool3x3-3x3x4(relu?)-maxpool3x3-256(relu?)-256(relu?)
value network と policy network は完全に別で、パラメータをシェアしない？（SACだからそういうこと？）

Author

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Yang, Zhaoyang, et al. "Hierarchical deep reinforcement learning for continuous action control." IEEE transactions on neural networks and learning systems 29.11 (2018): 5174-5184.

階層的な学習をさせているらしい。DDPGベースのアルゴリズム

ロボットはGray Scale の画像と4方向のレンジセンサーが入力らしい
8x8x32(stride4)-4x4x64(stride2)-(512-256(meta critic), 300-300(critic), 200-150(actor))
2次元（車輪ロボットの両輪の回転速度）
どれだけうまくいっているのかは正直よくわからない...

Author

ugo-nama-kun commented Sep 29, 2021

Kalashnikov, Dmitry, et al. "Scalable deep reinforcement learning for vision-based robotic manipulation." Conference on Robot Learning. PMLR, 2018.

マニピュレーションタスクを解く

Deep Q-learning で行動価値 $Q(s, a)$ を計算
連続+離散行動のミックスの行動 $a$ なので、greedy action は Cross Entropy Method で計算する
ネットワークの詳細は以下。大きいのでトレーニングのシステムも工夫しているらしい

Author

ugo-nama-kun commented Sep 29, 2021

Lee, Kyowoon, et al. "Deep reinforcement learning in continuous action spaces: a case study in the game of simulated curling." International conference on machine learning. PMLR, 2018.

deep reinforcement learning でカーリングを学習させている

alpha Go みたいな学習手法
教師あり学習・MCTS等々、似たような形。
行動は離散化されている：The policy head pθ outputs p which is the probability distribution of actions for selecting the best shot out of 32x32x2

discretized actions (clockwise or counter-clockwise spin)

Author

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Cimurs, Reinis, Jin Han Lee, and Il Hong Suh. "Goal-oriented obstacle avoidance with deep reinforcement learning in continuous action space." Electronics 9.3 (2020): 411.

Depth 画像の時系列を入れて、２輪車をゴールまでナビゲーション

2次元の行動（ステアリング+進行速度）
ゴール座標も入力に入れる（ $P_t$ ）
Activation は ReLU
reward はクラッシュすると負、ゴールで正、直進でスピードをあげると正の報酬になる
※ 2次元とかのアクションなら割と適当でも学習するのか？

Author

ugo-nama-kun commented Sep 29, 2021

Yarats, Denis, et al. "Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning." arXiv preprint arXiv:2107.09645 (2021).

データのAugmentationで学習を向上させたDDPG（TD3）らしい

Encoder ネットワークで画像をエンコードして、それをDDPGにくわせる
学習でエンコーダーに食わせるときにデータに画像をランダムシフトさせて学習させる（pad=4）
エンコーダーは 3x3x32(stride2)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu
エンコードした状態 h で残りをDDPGで学習させる。エンコーダーは別に学習
DDPG の policy, value はいずれも 1024-relu-1024-relu-output の形
実験的には n-step return でうまくいっているらしい
While some methods [Hafner et al., 2020] employ more sophisticated techniques such as TD(λ) or
Retrace(λ) [Munos et al., 2016], they are often computationally demanding when n is large. We find
that using simple n-step returns, without an importance sampling correction, strikes a good balance
between performance and efficiency

Author

ugo-nama-kun commented Sep 29, 2021

Merel, Josh, et al. "Hierarchical visuomotor control of humanoids." arXiv preprint arXiv:1811.09656 (2018).

Control Fragments/ Controller Policy は事前学習して、上段の行動をスクラッチで学習する

Resnet を画像入力に使っている。
基本的なlocomotion機能は事前学習してある

Author

ugo-nama-kun commented Sep 29, 2021

Song, H. Francis, et al. "V-MPO: On-policy maximum a posteriori policy optimization for discrete and continuous control." arXiv preprint arXiv:1909.12238 (2019).

MAP推定と色々なテクニック満載の手法なので詳しくは割愛

Resnet を画像の入力に使っている
Resnetの後は LSTMで、512-256のMLPでvalue & policy?

Author

ugo-nama-kun commented Sep 29, 2021

Merel, Josh, et al. "Deep neuroethology of a virtual rodent." arXiv preprint arXiv:1911.09451 (2019).

シミュレータの中のrodentを学習させたやつ

詳しい構造はかなり不明
MPO(maximum ap posteriori policy optimization) で最適化しているみたい
こっちはhumanoidと異なり、スクラッチで学習させているっぽい

Author

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Laskin, Michael, et al. "Reinforcement learning with augmented data." arXiv preprint arXiv:2004.14990 (2020).

入力画像を色々augmentationして

PPO も紹介しているが、基本的に SAC の拡張を目的にしているようだ
data-augmentation ベースの学習ブースト
CNNのネットワークはIMPALAで使われたモデルを参考にしているらしい
https://mishalaskin.github.io/rad/

参考：IMPALAのモデル

ladder はdmlab-30に出てくる自然言語入力を表している：Hermann, Karl Moritz, et al. "Grounded language learning in a simulated 3d world." arXiv preprint arXiv:1706.06551 (2017).

data augumentation

Author

ugo-nama-kun commented Sep 29, 2021

参考：DMLab-30

基本的に、離散行動のタスク集合
https://github.com/deepmind/lab/tree/master/game_scripts/levels/contributed/dmlab30

Author

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Yarats, Denis, et al. "Improving sample efficiency in model-free reinforcement learning from images." arXiv preprint arXiv:1910.01741 (2019).

SAC に auto-encoder (RAE) を引っ付けて画像に対して学習

value と RAE は convnet の学習に寄与するが、policyはそうではない（以下のTassaの論文がこの辺のノウハウのベース？）
ネットワークの構造に独自性が見られる
- 基本は deepmind control suite の論文がベース：Tassa, Yuval, et al. "Deepmind control suite." arXiv preprint arXiv:1801.00690 (2018).
- さらに 2段のCNNレイヤーを追加
- 3x3x32(stride2)-elu-3x3x32(stride1)-elu-3x3x32(stride1)-elu-3x3x32(stride1)-layerNorm-50-tanh-(1024-action, 1024-value)
- decoder は deconvnetで構成。
- 3x3x32(stride1)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu-3x3x32(stride2)
  参考：Regularized Auto Encoder:https://arxiv.org/pdf/1903.12436.pdf%20http://arxiv.org/abs/1903.12436.pdf

Author

ugo-nama-kun commented Sep 29, 2021

Tassa, Yuval, et al. "Deepmind control suite." arXiv preprint arXiv:1801.00690 (2018).

https://github.com/deepmind/dm_control

Deepmind control Suite についての論文

A3C, DDPG, D4PG(deep distributed determinictic policy gradient)のベンチマークがある
D4PGで RGB タスクをしている
84x84RGB-3x3x32(stride2)-elu-3x3x32(stride1)-elu-50-layerNorm-tanh-(300-relu-200-action, 400-relu-300-value)
トレーニングでは CNN のアップデートは Critic のみでやって、ActorのアップデートではCNNは更新しない（!!!）

Author

ugo-nama-kun commented Sep 29, 2021

Kostrikov, Ilya, Denis Yarats, and Rob Fergus. "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels." arXiv preprint arXiv:2004.13649 (2020).

上の Yarats, Denis, et al. "Improving sample efficiency in model-free reinforcement learning from images." arXiv preprint arXiv:1910.01741 (2019). を拡張したやつ。ただし、decoder networkがいらない

https://sites.google.com/view/data-regularized-q

SAC に Image Augumentation するやつ DrQ-v2の前のバージョン

SACとDQNにimage data augmentationを適用
3x3x32(stride2)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu-layerNorm-50-tanh-(1024--relu-1024-relu-action, 1024-relu-1024-relu-value)
全てのweightはorthogonal initialization、biasはゼロ
これも actor は CNN をアップデートせず、critic でのみ CNN を更新する
critic は double Q-learningで更新

Author

ugo-nama-kun commented Sep 29, 2021

Hafner, Danijar, et al. "Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:1912.01603 (2019).

Dreamer は複雑なので今回はパス

Author

ugo-nama-kun commented Sep 29, 2021

Lee, Alex X., et al. "Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model." arXiv preprint arXiv:1907.00953 (2019).

https://alexlee-gk.github.io/slac/

SAC を隠れ状態も含めて定式化したもの

ELBOで定式化しているところは（たぶん）きれいだと思う
decoder p(x|z) は 5 transposed convolutional layers (256 4 × 4, 128 3 × 3, 64 3 × 3, 32 3 × 3, and 3 5 × 5 filters, respectively, stride 2 each, except for the first layer)
q(z|x) 5 convolutional layers (32 5 × 5, 64 3 × 3, 128 3 × 3, 256 3 × 3, and 256 4 × 4 filters, respectively, stride 2 each, except for the last layer)
q(z'|x,z,a)は 2 fully connected layers (256 units each), and a Gaussian output layer
latent variable は z_1=32dim, z_2=256 dim
critic は 256-256-value
actor は 5 convnet layer-256-256-tanh?-action: ここでもCNNの更新は他のobjectiveで更新されていて、actor はそのCNNを利用しているだけ

ugo-nama-kun/visual_rl.md

Deep RL + Continuous Control with Vision

まとめた後の画像エージェントの構成パターン

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Uh oh!

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Uh oh!

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun/visual_rl.md

Deep RL + Continuous Control with Vision

まとめた後の画像エージェントの構成パターン

ugo-nama-kun commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. PMLR, 2016.

TORCS カーシミュレータ

Mujoco (pendulum, pointmass2D, and gripper)

Uh oh!

ugo-nama-kun commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Haarnoja, Tuomas, et al. "Soft actor-critic algorithms and applications." arXiv preprint arXiv:1812.05905 (2018).

マニピュレーションタスクを画像で解かせている

Uh oh!

ugo-nama-kun commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Yang, Zhaoyang, et al. "Hierarchical deep reinforcement learning for continuous action control." IEEE transactions on neural networks and learning systems 29.11 (2018): 5174-5184.

階層的な学習をさせているらしい。DDPGベースのアルゴリズム

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Kalashnikov, Dmitry, et al. "Scalable deep reinforcement learning for vision-based robotic manipulation." Conference on Robot Learning. PMLR, 2018.

マニピュレーションタスクを解く

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Lee, Kyowoon, et al. "Deep reinforcement learning in continuous action spaces: a case study in the game of simulated curling." International conference on machine learning. PMLR, 2018.

deep reinforcement learning でカーリングを学習させている

Uh oh!

ugo-nama-kun commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cimurs, Reinis, Jin Han Lee, and Il Hong Suh. "Goal-oriented obstacle avoidance with deep reinforcement learning in continuous action space." Electronics 9.3 (2020): 411.

Depth 画像の時系列を入れて、２輪車をゴールまでナビゲーション

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Yarats, Denis, et al. "Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning." arXiv preprint arXiv:2107.09645 (2021).

データのAugmentationで学習を向上させたDDPG（TD3）らしい

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Merel, Josh, et al. "Hierarchical visuomotor control of humanoids." arXiv preprint arXiv:1811.09656 (2018).

Control Fragments/ Controller Policy は事前学習して、上段の行動をスクラッチで学習する

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Song, H. Francis, et al. "V-MPO: On-policy maximum a posteriori policy optimization for discrete and continuous control." arXiv preprint arXiv:1909.12238 (2019).

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Merel, Josh, et al. "Deep neuroethology of a virtual rodent." arXiv preprint arXiv:1911.09451 (2019).

シミュレータの中のrodentを学習させたやつ

Uh oh!

ugo-nama-kun commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Laskin, Michael, et al. "Reinforcement learning with augmented data." arXiv preprint arXiv:2004.14990 (2020).

入力画像を色々augmentationして

参考：IMPALAのモデル

data augumentation

Uh oh!

ugo-nama-kun commented Sep 29, 2021

参考：DMLab-30

Uh oh!

ugo-nama-kun commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Yarats, Denis, et al. "Improving sample efficiency in model-free reinforcement learning from images." arXiv preprint arXiv:1910.01741 (2019).

SAC に auto-encoder (RAE) を引っ付けて画像に対して学習

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Tassa, Yuval, et al. "Deepmind control suite." arXiv preprint arXiv:1801.00690 (2018).

Deepmind control Suite についての論文

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Kostrikov, Ilya, Denis Yarats, and Rob Fergus. "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels." arXiv preprint arXiv:2004.13649 (2020).

SAC に Image Augumentation するやつ DrQ-v2の前のバージョン

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Hafner, Danijar, et al. "Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:1912.01603 (2019).

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Lee, Alex X., et al. "Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model." arXiv preprint arXiv:1907.00953 (2019).

SAC を 隠れ状態も含めて定式化したもの

Uh oh!

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

ugo-nama-kun commented Sep 29, 2021 •

edited

Loading

SAC を隠れ状態も含めて定式化したもの