ugo-nama-kun/visual_rl.md

Last active October 3, 2021 11:34

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ugo-nama-kun/586f6d7683a2b79d6c42fba0277d8fb3.js"></script>
Save ugo-nama-kun/586f6d7683a2b79d6c42fba0277d8fb3 to your computer and use it in GitHub Desktop.

Download ZIP

連続行動＋視覚入力を使った深層強化学習まとめ

Raw

visual_rl.md

Deep RL + Continuous Control with Vision

深層強化学習で 連続行動 と 視覚入力 を使ったものをまとめる
特に重要なテクニックが書かれていればそれも書き出す
マルチモーダルな強化学習もあれば書いておく

まとめた後の画像エージェントの構成パターン

SAC のような形で、完全に actor と critic でネットワークを分けて CNN を2つ利用する
actor と critic で CNN は共有するが、CNNの更新はcriticでのみしてactorはそれを利用する
actor と critic で CNN を利用するが、CNNの更新はAuto encoderなど別のLossをつかう

Author

ugo-nama-kun commented Sep 29, 2021

Kostrikov, Ilya, Denis Yarats, and Rob Fergus. "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels." arXiv preprint arXiv:2004.13649 (2020).

上の Yarats, Denis, et al. "Improving sample efficiency in model-free reinforcement learning from images." arXiv preprint arXiv:1910.01741 (2019). を拡張したやつ。ただし、decoder networkがいらない

https://sites.google.com/view/data-regularized-q

SAC に Image Augumentation するやつ DrQ-v2の前のバージョン

SACとDQNにimage data augmentationを適用
3x3x32(stride2)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu-3x3x32(stride1)-relu-layerNorm-50-tanh-(1024--relu-1024-relu-action, 1024-relu-1024-relu-value)
全てのweightはorthogonal initialization、biasはゼロ
これも actor は CNN をアップデートせず、critic でのみ CNN を更新する
critic は double Q-learningで更新

Author

ugo-nama-kun commented Sep 29, 2021

Hafner, Danijar, et al. "Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:1912.01603 (2019).

Dreamer は複雑なので今回はパス

Author

ugo-nama-kun commented Sep 29, 2021

Lee, Alex X., et al. "Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model." arXiv preprint arXiv:1907.00953 (2019).

https://alexlee-gk.github.io/slac/

SAC を隠れ状態も含めて定式化したもの

ELBOで定式化しているところは（たぶん）きれいだと思う
decoder p(x|z) は 5 transposed convolutional layers (256 4 × 4, 128 3 × 3, 64 3 × 3, 32 3 × 3, and 3 5 × 5 filters, respectively, stride 2 each, except for the first layer)
q(z|x) 5 convolutional layers (32 5 × 5, 64 3 × 3, 128 3 × 3, 256 3 × 3, and 256 4 × 4 filters, respectively, stride 2 each, except for the last layer)
q(z'|x,z,a)は 2 fully connected layers (256 units each), and a Gaussian output layer
latent variable は z_1=32dim, z_2=256 dim
critic は 256-256-value
actor は 5 convnet layer-256-256-tanh?-action: ここでもCNNの更新は他のobjectiveで更新されていて、actor はそのCNNを利用しているだけ

ugo-nama-kun/visual_rl.md

Deep RL + Continuous Control with Vision

まとめた後の画像エージェントの構成パターン

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Uh oh!

ugo-nama-kun/visual_rl.md

Deep RL + Continuous Control with Vision

まとめた後の画像エージェントの構成パターン

ugo-nama-kun commented Sep 29, 2021

Kostrikov, Ilya, Denis Yarats, and Rob Fergus. "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels." arXiv preprint arXiv:2004.13649 (2020).

SAC に Image Augumentation するやつ DrQ-v2の前のバージョン

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Hafner, Danijar, et al. "Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:1912.01603 (2019).

Uh oh!

ugo-nama-kun commented Sep 29, 2021

Lee, Alex X., et al. "Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model." arXiv preprint arXiv:1907.00953 (2019).

SAC を 隠れ状態も含めて定式化したもの

Uh oh!

SAC を隠れ状態も含めて定式化したもの