Created
July 23, 2018 06:22
-
-
Save simoninithomas/df596b23d70a941a776d828a061a56b9 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Understand A2C implementation playing Sonic the Hedgehog 🦔\n", | |
"\n", | |
"This Notebook explains the Advantage Actor Critic implementation.<br>\n", | |
"The [repository link](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/tree/master/A2C%20with%20Sonic%20the%20Hedgehog) \n", | |
"\n", | |
"### Acknowledgements 👏\n", | |
"This implementation was inspired by 2 repositories:\n", | |
"- OpenAI [Baselines A2C](https://github.com/openai/baselines/tree/master/baselines/a2c)\n", | |
"- Alexandre Borghi [retro_contest_agent](https://github.com/aborghi)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"assets/sonic.gif\" alt=\"Sonic the hedgehog Deep reinforcement learning course\"/>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n", | |
"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\"/>\n", | |
"<br>\n", | |
"<p> Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**\n", | |
"<br><br>\n", | |
" \n", | |
"📜The articles explain the architectures from the big picture to the mathematical details behind them.\n", | |
"<br>\n", | |
"📹 The videos explain how to build the agents with Tensorflow </b></p>\n", | |
"<br>\n", | |
"This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. \n", | |
"<br><br>\n", | |
"</p> \n", | |
"\n", | |
"## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n", | |
"\n", | |
"\n", | |
"## Any questions 👨💻\n", | |
"<p> If you have any questions, feel free to ask me: </p>\n", | |
"<p> 📧: <a href=\"mailto:[email protected]\">[email protected]</a> </p>\n", | |
"<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n", | |
"<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n", | |
"<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n", | |
"<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n", | |
" \n", | |
"## How to help 🙌\n", | |
"3 ways:\n", | |
"- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n", | |
"- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n", | |
"- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n", | |
"<br>\n", | |
"\n", | |
"## Important note 🤔\n", | |
"<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)\n", | |
"https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning\n", | |
"<br>\n", | |
"⚠️ I don't have any business relations with them. I just loved their excellent customer service.\n", | |
"\n", | |
"If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## How to use it? 📖\n", | |
"⚠️ First you need to follow Step 1 (Download Sonic the Hedgehog)\n", | |
"### Watch our agent playing 👀\n", | |
"- **Modify the environment**: change `env.make_train_3`with the env you want in `env= DummyVecEnv([env.make_train_3]))` in play.py\n", | |
"- **See the agent playing**: run `python play.py`\n", | |
"\n", | |
"### Continue to train 🏃♂️\n", | |
"⚠️ There is a big risk **of overfitting**\n", | |
"- Run `python agent.py`" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 1: Download Sonic the Hedgehog 💻" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"1. The first step is to download the game, to make it works on retro **you need to buy it legally on [Steam](https://store.steampowered.com/app/71113/Sonic_The_Hedgehog/)**\n", | |
"<br>\n", | |
"<img src=\"https://steamcdn-a.akamaihd.net/steam/apps/71113/header.jpg?t=1495556871\" alt=\"Sonic the Hedgehog\"/>\n", | |
"<br>\n", | |
"2. Then follow the **Quickstart part** of this [website](https://contest.openai.com/2018-1/details/)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 2: Build all elements we need for our environement in sonic_env.py 🖼️" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- `PreprocessFrame(gym.ObservationWrapper)` : in this class we will **preprocess our environment**\n", | |
" - Set frame to gray \n", | |
" - Resize the frame to 96x96x1\n", | |
"<br>\n", | |
"<br>\n", | |
"- `ActionsDiscretizer(gym.ActionWrapper)` : in this class we **limit the possibles actions in our environment** (make it discrete)\n", | |
"\n", | |
"In fact you'll see that for each action in actions:\n", | |
" Create an array of 12 False (12 = nb of buttons)\n", | |
" For each button in action: (for instance ['LEFT']) we need to make that left button index = True\n", | |
" Then the button index = LEFT = True\n", | |
"\n", | |
"--> In fact at the end we will have an array where each array is an action and **each elements True of this array are the buttons clicked.**\n", | |
"For instance LEFT action = [F, F, F, F, F, F, T, F, F, F, F, F]\n", | |
"<br><br>\n", | |
"- `RewardScaler(gym.RewardWrapper)`: We **scale the rewards** to reasonable scale (useful in PPO but also in A2C).\n", | |
"<br><br>\n", | |
"- `AllowBacktracking(gym.Wrapper)`: **Allow the agent to go backward without being discourage** (avoid our agent to be stuck on a wall during the game).\n", | |
"<br><br>\n", | |
"\n", | |
"- `make_env(env_idx)` : **Build an environement** (and stack 4 frames together using `FrameStack`\n", | |
"<br><br>\n", | |
"The idea is that we'll build multiple instances of the environment, different environments each times (different level) **to avoid overfitting and helping our agent to generalize better at playing sonic** \n", | |
"<br><br>\n", | |
"To handle these multiple environements we'll use `SubprocVecEnv` that **creates a vector of n environments to run them simultaneously.**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 3: Build the A2C architecture in architecture.py 🧠" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- `from baselines.common.distributions import make_pdtype`: This function selects the **probability distribution over actions**\n", | |
"<br><br>\n", | |
"\n", | |
"- First, we create two functions that will help us to avoid to call conv and fc each time.\n", | |
" - `conv`: function to create a convolutional layer.\n", | |
" - `fc`: function to create a fully connected layer.\n", | |
"<br><br>\n", | |
"- Then, we create `A2CPolicy`, the object that **contains the architecture**<br>\n", | |
"3 CNN for spatial dependencies<br>\n", | |
"Temporal dependencies is handle by stacking frames<br>\n", | |
"(Something funny nobody use LSTM in OpenAI Retro contest)<br>\n", | |
"1 common FC<br>\n", | |
"1 FC for policy<br>\n", | |
"1 FC for value\n", | |
"<br><br>\n", | |
"- `self.pdtype = make_pdtype(action_space)`: Based on the action space, will **select what probability distribution typewe will use to distribute action in our stochastic policy** (in our case DiagGaussianPdType aka Diagonal Gaussian, multivariate normal distribution\n", | |
"<br><br>\n", | |
"- `self.pdtype.pdfromlatent` : return a returns a probability distribution over actions (self.pd) and our pi logits (self.pi).\n", | |
"<br><br>\n", | |
"\n", | |
"- We create also 3 useful functions in A2CPolicy\n", | |
" - `def step(state_in, *_args, **_kwargs)`: Function use to take a step returns **action to take and V(s)**\n", | |
" - `def value(state_in, *_args, **_kwargs)`: Function that calculates **only the V(s)**\n", | |
" - `def select_action(state_in, *_args, **_kwargs)`: Function that output **only the action to take**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 4: Build the Model in model.py 🏗️" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We use Model object to:\n", | |
"__init__:\n", | |
"- `policy(sess, ob_space, action_space, nenvs, 1, reuse=False)`: Creates the step_model (used for sampling)\n", | |
"- `policy(sess, ob_space, action_space, nenvs*nsteps, nsteps, reuse=True)`: Creates the train_model (used for training)\n", | |
"\n", | |
"- `def train(states_in, actions, returns, values, lr)`: Make the training part (calculate advantage and feedforward and retropropagation of gradients)\n", | |
"\n", | |
"save/load():\n", | |
"- `def save(save_path)`:Save the weights.\n", | |
"- `def load(load_path)`:Load the weights\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 5: Build the Runner in model.py 🏃\n", | |
"Runner will be used to make a mini batch of experiences\n", | |
"- Each environement send 1 timestep (4 frames stacked) (`self.obs`)\n", | |
"- This goes through `step_model`\n", | |
" - Returns actions, values.\n", | |
"- Append `mb_obs`, `mb_actions`, `mb_values`, `mb_dones`.\n", | |
"- Take actions in environments and watch the consequences\n", | |
" - return `obs`, `rewards`, `dones`\n", | |
"- We need to calculate advantage to do that we use General Advantage Estimation" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 6: Build the learn function in model.py\n", | |
"The learn function can be seen as **the gathering of all the logic of our A2C**\n", | |
"- Instantiate the model object (that creates step_model and train_model)\n", | |
"- Instantiate the runner object\n", | |
"- Train always in two phases:\n", | |
" - Run to get a batch of experiences\n", | |
" - Train that batch of experiences\n", | |
"<br><br>\n", | |
"\n", | |
"We use explained_variance which is a **really important parameter**:\n", | |
"<br><br>\n", | |
"`ev = 1 - Variance[y - ypredicted] / Variance [y]`\n", | |
"<br><br>\n", | |
"In fact this calculates **if value function is a good predictor of the returns or if it's just worse than predicting nothing.**\n", | |
"ev=0 => might as well have predicted zero\n", | |
"ev<0 => worse than just predicting zero so you're overfitting (need to tune some hyperparameters)\n", | |
"ev=1 => perfect prediction\n", | |
"\n", | |
"--> The goal is that **ev goes closer and closer to 1.**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 7: Build the play function in model.py\n", | |
"- This function will be use to play an environment using the trained model." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Step 8: Build the agent.py" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- `config.gpu_options.allow_growth = True` : This creates a GPU session\n", | |
"- `model.learn(policy=policies.A2CPolicy,env=SubprocVecEnv([env.make_train_0, env.make_train_1, env.make_train_2, env.make_train_3, env.make_train_4, env.make_train_5,env.make_train_6,env.make_train_7,env.make_train_8,\n", | |
"env.make_train_9,env.make_train_10,env.make_train_11,env.make_train_12 ]), nsteps=2048, # Steps per environment\n", | |
"total_timesteps=10000000,gamma=0.99,lam = 0.95,vf_coef=0.5,ent_coef=0.01,lr = 2e-4,max_grad_norm = 0.5, log_interval = 10)` : Here we just call the learn function that contains all the elements needed to train our A2C agent\n" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [conda env:gameplai]", | |
"language": "python", | |
"name": "conda-env-gameplai-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment