simoninithomas · April 19, 2019 07:26
diff --git a/Doom REINFORCE Monte Carlo Policy gradients.ipynb b/Doom REINFORCE Monte Carlo Policy gradients.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Doom-Health: REINFORCE Monte Carlo Policy gradients 🕹️"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we'll implement an agent <b>that try to survive in Doom environment by using a Policy Gradient architecture.</b> <br>\n",
    "Our agent playing Doom:\n",
    "\n",
    "<img src=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/projects/projectw4_reduced.gif\" style=\"max-width: 600px;\" alt=\"Policy Gradient with Doom\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# You can follow this notebook with this video tutorial 📹 that will helps you to understand each step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/wLTQRuizVyE?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "HTML('<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/wLTQRuizVyE?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This is a notebook from Deep Reinforcement Learning Course with Tensorflow\n",
    "<img src=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/preview.jpg\" alt=\"Deep Reinforcement Course\" style=\"width: 500px;\"/>\n",
    "\n",
    "<p>  Deep Reinforcement Learning Course is a free series of blog posts and videos 🆕 about Deep Reinforcement Learning, where we'll learn the main algorithms, and how to implement them with Tensorflow.\n",
    "\n",
    "📜The articles explain the concept from the big picture to the mathematical details behind it.\n",
    "\n",
    "📹 The videos explain how to create the agent with Tensorflow </b></p>\n",
    "\n",
    "## <a href=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/\">Syllabus</a><br>\n",
    "### 📜 Part 1: Introduction to Reinforcement Learning [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419) \n",
    "\n",
    "### Part 2: Q-learning with FrozenLake \n",
    "#### 📜 [ARTICLE](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe) // [FROZENLAKE IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb)\n",
    "#### 📹 [Implementing a Q-learning agent that plays Taxi-v2 🚕](https://youtu.be/q2ZOEFAaaI0) \n",
    "\n",
    "### Part 3: Deep Q-learning with Doom\n",
    "#### 📜 [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)  //  [DOOM IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/DQN%20Doom/Deep%20Q%20learning%20with%20Doom.ipynb)\n",
    "#### 📹 [Create a DQN Agent that learns to play Atari Space Invaders 👾 ](https://youtu.be/gCJyVX98KJ4)\n",
    "\n",
    "### Part 3+: Improvments in Deep Q-Learning\n",
    "#### 📜 [ARTICLE (📅 JUNE)] \n",
    "#### 📹 [Create an Agent that learns to play Doom Deadly corridor (📅 06/27 )] \n",
    "\n",
    "### Part 4: Policy Gradients with Doom \n",
    "#### 📜 [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f) //  [CARTPOLE IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb) // [DOOM IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Doom/Doom%20REINFORCE%20Monte%20Carlo%20Policy%20gradients.ipynb)\n",
    "#### 📹 [Create an Agent that learns to play Doom deathmatch](https://youtu.be/wLTQRuizVyE) \n",
    "\n",
    "### Part 5: Advantage Advantage Actor Critic (A2C) \n",
    "#### 📜 [ARTICLE (📅 June)] \n",
    "#### 📹 [Create an Agent that learns to play Outrun (📅 07/04)] \n",
    "\n",
    "### Part 6: Asynchronous Advantage Actor Critic (A3C) \n",
    "#### 📜 [ARTICLE (📅 July)] \n",
    "#### 📹 [Create an Agent that learns to play Michael Jackson's Moonwalker (📅 07/11)] \n",
    "\n",
    "### Part 7: Proximal Policy Gradients \n",
    "#### 📜 [ARTICLE (📅 July)]\n",
    "#### 📹 [Create an Agent that learns to play walk with Mujoco (📅 07/18)]\n",
    "\n",
    "### Part 8: TBA \n",
    "\n",
    "## Any questions 👨‍💻\n",
    "<p> If you have any questions, feel free to ask me: </p>\n",
    "<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a>  </p>\n",
    "<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
    "<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
    "<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
    "<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
    "    \n",
    "## How to help  🙌\n",
    "3 ways:\n",
    "- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
    "- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n",
    "- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n",
    "<br>\n",
    "\n",
    "## Important note 🤔\n",
    "<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)\n",
    "https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning\n",
    "<br>\n",
    "⚠️ I don't have any business relations with them. I just loved their excellent customer service.\n",
    "\n",
    "If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import the libraries 📚"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf      # Deep Learning library\n",
    "import numpy as np           # Handle matrices\n",
    "from vizdoom import *        # Doom Environment\n",
    "import random                # Handling random number generation\n",
    "import time                  # Handling time calculation\n",
    "from skimage import transform# Help us to preprocess the frames\n",
    "\n",
    "from collections import deque# Ordered collection with ends\n",
    "import matplotlib.pyplot as plt # Display graphs\n",
    "\n",
    "import warnings # This ignore all the warning messages that are normally printed during the training because of skiimage\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create our environment 🎮\n",
    "- Now that we imported the libraries/dependencies, we will create our environment.\n",
    "- Doom environment takes:\n",
    "    - A `configuration file` that **handle all the options** (size of the frame, possible actions...)\n",
    "    - A `scenario file`: that **generates the correct scenario** (in our case basic **but you're invited to try other scenarios**).\n",
    "- Note: We have 3 possible actions `[[0,0,1], [1,0,0], [0,1,0]]` so we don't need to do one hot encoding (thanks to < a href=\"https://stackoverflow.com/users/2237916/silgon\">silgon</a> for figuring out. \n",
    "\n",
    "### Our environment\n",
    "<img src=\"assets/health_doom.jpg\" style=\"max-width:500px;\" alt=\"Doom health\"/>\n",
    "\n",
    "The purpose of this scenario is to teach the agent **how to survive without knowing what makes him survive.** Agent know only that life is precious and death is bad so he must learn what prolongs his existence and that his health is connected with it.\n",
    "\n",
    "Map is a rectangle with green, acidic floor which hurts the player periodically. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health - to survive agent needs to pick them up. Episode finishes after player's death or on timeout.**\n",
    "\n",
    "Further configuration:\n",
    "\n",
    "- living_reward = 1\n",
    "- 3 available buttons: turn left, turn right, move forward\n",
    "- 1 available game variable: HEALTH\n",
    "- death penalty = 100\n",
    "<br><br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Here we create our environment\n",
    "\"\"\"\n",
    "def create_environment():\n",
    "    game = DoomGame()\n",
    "    \n",
    "    # Load the correct configuration\n",
    "    game.load_config(\"health_gathering.cfg\")\n",
    "    \n",
    "    # Load the correct scenario (in our case defend_the_center scenario)\n",
    "    game.set_doom_scenario_path(\"health_gathering.wad\")\n",
    "    \n",
    "    game.init()\n",
    "    \n",
    "    # Here our possible actions\n",
    "    # [[1,0,0],[0,1,0],[0,0,1]]\n",
    "    possible_actions  = np.identity(3,dtype=int).tolist()\n",
    "    \n",
    "    return game, possible_actions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "game, possible_actions = create_environment()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Define the preprocessing functions ⚙️\n",
    "### preprocess_frame 🖼️\n",
    "Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>\n",
    "<br><br>\n",
    "Our steps:\n",
    "- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.\n",
    "- Crop the screen (in our case we remove the roof because it contains no information)\n",
    "- We normalize pixel values\n",
    "- Finally we resize the preprocessed frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "    preprocess_frame:\n",
    "    Take a frame.\n",
    "    Resize it.\n",
    "        __________________\n",
    "        |                 |\n",
    "        |                 |\n",
    "        |                 |\n",
    "        |                 |\n",
    "        |_________________|\n",
    "        \n",
    "        to\n",
    "        _____________\n",
    "        |            |\n",
    "        |            |\n",
    "        |            |\n",
    "        |____________|\n",
    "    Normalize it.\n",
    "    \n",
    "    return preprocessed_frame\n",
    "    \n",
    "    \"\"\"\n",
    "def preprocess_frame(frame):\n",
    "    # Greyscale frame already done in our vizdoom config\n",
    "    # x = np.mean(frame,-1)\n",
    "    \n",
    "    # Crop the screen (remove the roof because it contains no information)\n",
    "    # [Up: Down, Left: right]\n",
    "    cropped_frame = frame[80:,:]\n",
    "    \n",
    "    # Normalize Pixel Values\n",
    "    normalized_frame = cropped_frame/255.0\n",
    "    \n",
    "    # Resize\n",
    "    preprocessed_frame = transform.resize(normalized_frame, [84,84])\n",
    "    \n",
    "    return preprocessed_frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### stack_frames\n",
    "👏 This part was made possible thanks to help of <a href=\"https://github.com/Miffyli\">Anssi</a><br>\n",
    "\n",
    "As explained in this really <a href=\"https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/\">  good article </a> we stack frames.\n",
    "\n",
    "Stacking frames is really important because it helps us to **give have a sense of motion to our Neural Network.**\n",
    "\n",
    "- First we preprocess frame\n",
    "- Then we append the frame to the deque that automatically **removes the oldest frame**\n",
    "- Finally we **build the stacked state**\n",
    "\n",
    "This is how work stack:\n",
    "- For the first frame, we feed 4 frames\n",
    "- At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**\n",
    "- And so on\n",
    "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN/Space%20Invaders/assets/stack_frames.png\" alt=\"stack\">\n",
    "- If we're done, **we create a new stack with 4 new frames (because we are in a new episode)**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "stack_size = 4 # We stack 4 frames\n",
    "\n",
    "# Initialize deque with zero-images one array for each image\n",
    "stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4) \n",
    "\n",
    "def stack_frames(stacked_frames, state, is_new_episode):\n",
    "    # Preprocess frame\n",
    "    frame = preprocess_frame(state)\n",
    "    \n",
    "    if is_new_episode:\n",
    "        # Clear our stacked_frames\n",
    "        stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)\n",
    "        \n",
    "        # Because we're in a new episode, copy the same frame 4x\n",
    "        stacked_frames.append(frame)\n",
    "        stacked_frames.append(frame)\n",
    "        stacked_frames.append(frame)\n",
    "        stacked_frames.append(frame)\n",
    "        \n",
    "        # Stack the frames\n",
    "        stacked_state = np.stack(stacked_frames, axis=2)\n",
    "\n",
    "    else:\n",
    "        # Append frame to deque, automatically removes the oldest frame\n",
    "        stacked_frames.append(frame)\n",
    "\n",
    "        # Build the stacked state (first dimension specifies different frames)\n",
    "        stacked_state = np.stack(stacked_frames, axis=2) \n",
    "    \n",
    "    return stacked_state, stacked_frames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### discount_and_normalize_rewards 💰\n",
    "This function is important, because we are in a Monte Carlo situation. <br>\n",
    "\n",
    "We need to **discount the rewards at the end of the episode**. This function takes, the reward discount it, and **then normalize them** (to avoid a big variability in rewards)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def discount_and_normalize_rewards(episode_rewards):\n",
    "    discounted_episode_rewards = np.zeros_like(episode_rewards)\n",
    "    cumulative = 0.0\n",
    "    for i in reversed(range(len(episode_rewards))):\n",
    "        cumulative = cumulative * gamma + episode_rewards[i]\n",
    "        discounted_episode_rewards[i] = cumulative\n",
    "    \n",
    "    mean = np.mean(discounted_episode_rewards)\n",
    "    std = np.std(discounted_episode_rewards)\n",
    "    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)\n",
    "\n",
    "    return discounted_episode_rewards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Set up our hyperparameters ⚗️\n",
    "In this part we'll set up our different hyperparameters. But when you implement a Neural Network by yourself you will **not implement hyperparamaters at once but progressively**.\n",
    "\n",
    "- First, you begin by defining the neural networks hyperparameters when you implement the model.\n",
    "- Then, you'll add the training hyperparameters when you implement the training algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "### ENVIRONMENT HYPERPARAMETERS\n",
    "state_size = [84,84,4] # Our input is a stack of 4 frames hence 84x84x4 (Width, height, channels) \n",
    "action_size = game.get_available_buttons_size() # 3 possible actions: turn left, turn right, move forward\n",
    "stack_size = 4 # Defines how many frames are stacked together\n",
    "\n",
    "## TRAINING HYPERPARAMETERS\n",
    "learning_rate = 0.002\n",
    "num_epochs = 500 # Total epochs for training \n",
    "\n",
    "batch_size = 1000 # Each 1 is a timestep (NOT AN EPISODE) # YOU CAN CHANGE TO 5000 if you have GPU\n",
    "gamma = 0.95 # Discounting rate\n",
    "\n",
    "### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT\n",
    "training = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Quick note: Policy gradient methods like reinforce **are on-policy method which can not be updated from experience replay.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Create our Policy Gradient Neural Network model 🧠"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/Policy%20Gradients/Doom/assets/doomPG.png\" alt=\"Doom PG\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PGNetwork:\n",
    "    def __init__(self, state_size, action_size, learning_rate, name='PGNetwork'):\n",
    "        self.state_size = state_size\n",
    "        self.action_size = action_size\n",
    "        self.learning_rate = learning_rate\n",
    "        \n",
    "        with tf.variable_scope(name):\n",
    "            with tf.name_scope(\"inputs\"):\n",
    "                # We create the placeholders\n",
    "                # *state_size means that we take each elements of state_size in tuple hence is like if we wrote\n",
    "                # [None, 84, 84, 4]\n",
    "                self.inputs_= tf.placeholder(tf.float32, [None, *state_size], name=\"inputs_\")\n",
    "                self.actions = tf.placeholder(tf.int32, [None, action_size], name=\"actions\")\n",
    "                self.discounted_episode_rewards_ = tf.placeholder(tf.float32, [None, ], name=\"discounted_episode_rewards_\")\n",
    "            \n",
    "                \n",
    "                # Add this placeholder for having this variable in tensorboard\n",
    "                self.mean_reward_ = tf.placeholder(tf.float32, name=\"mean_reward\")\n",
    "                \n",
    "            with tf.name_scope(\"conv1\"):\n",
    "                \"\"\"\n",
    "                First convnet:\n",
    "                CNN\n",
    "                BatchNormalization\n",
    "                ELU\n",
    "                \"\"\"\n",
    "                # Input is 84x84x4\n",
    "                self.conv1 = tf.layers.conv2d(inputs = self.inputs_,\n",
    "                                             filters = 32,\n",
    "                                             kernel_size = [8,8],\n",
    "                                             strides = [4,4],\n",
    "                                             padding = \"VALID\",\n",
    "                                              kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n",
    "                                             name = \"conv1\")\n",
    "\n",
    "                self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,\n",
    "                                                       training = True,\n",
    "                                                       epsilon = 1e-5,\n",
    "                                                         name = 'batch_norm1')\n",
    "\n",
    "                self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name=\"conv1_out\")\n",
    "                ## --> [20, 20, 32]\n",
    "            \n",
    "            with tf.name_scope(\"conv2\"):\n",
    "                \"\"\"\n",
    "                Second convnet:\n",
    "                CNN\n",
    "                BatchNormalization\n",
    "                ELU\n",
    "                \"\"\"\n",
    "                self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,\n",
    "                                     filters = 64,\n",
    "                                     kernel_size = [4,4],\n",
    "                                     strides = [2,2],\n",
    "                                     padding = \"VALID\",\n",
    "                                    kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n",
    "                                     name = \"conv2\")\n",
    "\n",
    "                self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,\n",
    "                                                       training = True,\n",
    "                                                       epsilon = 1e-5,\n",
    "                                                         name = 'batch_norm2')\n",
    "\n",
    "                self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name=\"conv2_out\")\n",
    "                ## --> [9, 9, 64]\n",
    "            \n",
    "            with tf.name_scope(\"conv3\"):\n",
    "                \"\"\"\n",
    "                Third convnet:\n",
    "                CNN\n",
    "                BatchNormalization\n",
    "                ELU\n",
    "                \"\"\"\n",
    "                self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,\n",
    "                                     filters = 128,\n",
    "                                     kernel_size = [4,4],\n",
    "                                     strides = [2,2],\n",
    "                                     padding = \"VALID\",\n",
    "                                    kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n",
    "                                     name = \"conv3\")\n",
    "\n",
    "                self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,\n",
    "                                                       training = True,\n",
    "                                                       epsilon = 1e-5,\n",
    "                                                         name = 'batch_norm3')\n",
    "\n",
    "                self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name=\"conv3_out\")\n",
    "                ## --> [3, 3, 128]\n",
    "            \n",
    "            with tf.name_scope(\"flatten\"):\n",
    "                self.flatten = tf.layers.flatten(self.conv3_out)\n",
    "                ## --> [1152]\n",
    "            \n",
    "            with tf.name_scope(\"fc1\"):\n",
    "                self.fc = tf.layers.dense(inputs = self.flatten,\n",
    "                                      units = 512,\n",
    "                                      activation = tf.nn.elu,\n",
    "                                           kernel_initializer=tf.contrib.layers.xavier_initializer(),\n",
    "                                    name=\"fc1\")\n",
    "            \n",
    "            with tf.name_scope(\"logits\"):\n",
    "                self.logits = tf.layers.dense(inputs = self.fc, \n",
    "                                               kernel_initializer=tf.contrib.layers.xavier_initializer(),\n",
    "                                              units = 3, \n",
    "                                            activation=None)\n",
    "            \n",
    "            with tf.name_scope(\"softmax\"):\n",
    "                self.action_distribution = tf.nn.softmax(self.logits)\n",
    "                \n",
    "\n",
    "            with tf.name_scope(\"loss\"):\n",
    "                # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function\n",
    "                # If you have single-class labels, where an object can only belong to one class, you might now consider using \n",
    "                # tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. \n",
    "                self.neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits = self.logits, labels = self.actions)\n",
    "                self.loss = tf.reduce_mean(self.neg_log_prob * self.discounted_episode_rewards_) \n",
    "        \n",
    "    \n",
    "            with tf.name_scope(\"train\"):\n",
    "                self.train_opt = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reset the graph\n",
    "tf.reset_default_graph()\n",
    "\n",
    "# Instantiate the PGNetwork\n",
    "PGNetwork = PGNetwork(state_size, action_size, learning_rate)\n",
    "\n",
    "# Initialize Session\n",
    "sess = tf.Session()\n",
    "init = tf.global_variables_initializer()\n",
    "sess.run(init)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Set up Tensorboard 📊\n",
    "For more information about tensorboard, please watch this <a href=\"https://www.youtube.com/embed/eBbEDRsCmv4\">excellent 30min tutorial</a> <br><br>\n",
    "To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup TensorBoard Writer\n",
    "writer = tf.summary.FileWriter(\"/tensorboard/pg/test\")\n",
    "\n",
    "## Losses\n",
    "tf.summary.scalar(\"Loss\", PGNetwork.loss)\n",
    "\n",
    "## Reward mean\n",
    "tf.summary.scalar(\"Reward_mean\", PGNetwork.mean_reward_ )\n",
    "\n",
    "write_op = tf.summary.merge_all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Train our Agent 🏃‍♂️"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we'll create batches.<br>\n",
    "These batches contains episodes **(their number depends on how many rewards we collect**: for instance if we have episodes with only 10 rewards we can put batch_size/10 episodes\n",
    "<br>\n",
    "* Make a batch\n",
    "    * For each step:\n",
    "        * Choose action a\n",
    "        * Perform action a\n",
    "        * Store s, a, r\n",
    "        * **If** done:\n",
    "            * Calculate sum reward\n",
    "            * Calculate gamma Gt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_batch(batch_size, stacked_frames):\n",
    "    # Initialize lists: states, actions, rewards_of_episode, rewards_of_batch, discounted_rewards\n",
    "    states, actions, rewards_of_episode, rewards_of_batch, discounted_rewards = [], [], [], [], []\n",
    "    \n",
    "    # Reward of batch is also a trick to keep track of how many timestep we made.\n",
    "    # We use to to verify at the end of each episode if > batch_size or not.\n",
    "    \n",
    "    # Keep track of how many episodes in our batch (useful when we'll need to calculate the average reward per episode)\n",
    "    episode_num  = 1\n",
    "    \n",
    "    # Launch a new episode\n",
    "    game.new_episode()\n",
    "        \n",
    "    # Get a new state\n",
    "    state = game.get_state().screen_buffer\n",
    "    state, stacked_frames = stack_frames(stacked_frames, state, True)\n",
    "\n",
    "    while True:\n",
    "        # Run State Through Policy & Calculate Action\n",
    "        action_probability_distribution = sess.run(PGNetwork.action_distribution, \n",
    "                                                   feed_dict={PGNetwork.inputs_: state.reshape(1, *state_size)})\n",
    "        \n",
    "        # REMEMBER THAT WE ARE IN A STOCHASTIC POLICY SO WE DON'T ALWAYS TAKE THE ACTION WITH THE HIGHEST PROBABILITY\n",
    "        # (For instance if the action with the best probability for state S is a1 with 70% chances, there is\n",
    "        #30% chance that we take action a2)\n",
    "        action = np.random.choice(range(action_probability_distribution.shape[1]), \n",
    "                                  p=action_probability_distribution.ravel())  # select action w.r.t the actions prob\n",
    "        action = possible_actions[action]\n",
    "\n",
    "        # Perform action\n",
    "        reward = game.make_action(action)\n",
    "        done = game.is_episode_finished()\n",
    "\n",
    "        # Store results\n",
    "        states.append(state)\n",
    "        actions.append(action)\n",
    "        rewards_of_episode.append(reward)\n",
    "        \n",
    "        if done:\n",
    "            # The episode ends so no next state\n",
    "            next_state = np.zeros((84, 84), dtype=np.int)\n",
    "            next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n",
    "            \n",
    "            # Append the rewards_of_batch to reward_of_episode\n",
    "            rewards_of_batch.append(rewards_of_episode)\n",
    "            \n",
    "            # Calculate gamma Gt\n",
    "            discounted_rewards.append(discount_and_normalize_rewards(rewards_of_episode))\n",
    "           \n",
    "            # If the number of rewards_of_batch > batch_size stop the minibatch creation\n",
    "            # (Because we have sufficient number of episode mb)\n",
    "            # Remember that we put this condition here, because we want entire episode (Monte Carlo)\n",
    "            # so we can't check that condition for each step but only if an episode is finished\n",
    "            if len(np.concatenate(rewards_of_batch)) > batch_size:\n",
    "                break\n",
    "                \n",
    "            # Reset the transition stores\n",
    "            rewards_of_episode = []\n",
    "            \n",
    "            # Add episode\n",
    "            episode_num += 1\n",
    "            \n",
    "            # Start a new episode\n",
    "            game.new_episode()\n",
    "\n",
    "            # First we need a state\n",
    "            state = game.get_state().screen_buffer\n",
    "\n",
    "            # Stack the frames\n",
    "            state, stacked_frames = stack_frames(stacked_frames, state, True)\n",
    "         \n",
    "        else:\n",
    "            # If not done, the next_state become the current state\n",
    "            next_state = game.get_state().screen_buffer\n",
    "            next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n",
    "            state = next_state\n",
    "                         \n",
    "    return np.stack(np.array(states)), np.stack(np.array(actions)), np.concatenate(rewards_of_batch), np.concatenate(discounted_rewards), episode_num"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Create the Neural Network\n",
    "* Initialize the weights\n",
    "* Init the environment\n",
    "* maxReward = 0 # Keep track of maximum reward\n",
    "* **For** epochs in range(num_epochs):\n",
    "    * Get batches\n",
    "    * Optimize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==========================================\n",
      "Epoch:  1 / 500\n",
      "-----------\n",
      "Number of training episodes: 2\n",
      "Total reward: 1016.0\n",
      "Mean Reward of that batch 508.0\n",
      "Average Reward of all training: 508.0\n",
      "Max reward for a batch so far: 1016.0\n",
      "Training Loss: -0.013492405414581299\n",
      "==========================================\n",
      "Epoch:  2 / 500\n",
      "-----------\n",
      "Number of training episodes: 2\n",
      "Total reward: 888.0\n",
      "Mean Reward of that batch 444.0\n",
      "Average Reward of all training: 476.0\n",
      "Max reward for a batch so far: 1016.0\n",
      "Training Loss: -0.004388260655105114\n",
      "==========================================\n",
      "Epoch:  3 / 500\n",
      "-----------\n",
      "Number of training episodes: 2\n",
      "Total reward: 1848.0\n",
      "Mean Reward of that batch 924.0\n",
      "Average Reward of all training: 625.3333333333334\n",
      "Max reward for a batch so far: 1848.0\n",
      "Training Loss: -0.008172298781573772\n",
      "==========================================\n",
      "Epoch:  4 / 500\n",
      "-----------\n",
      "Number of training episodes: 3\n",
      "Total reward: 980.0\n",
      "Mean Reward of that batch 326.6666666666667\n",
      "Average Reward of all training: 550.6666666666666\n",
      "Max reward for a batch so far: 1848.0\n",
      "Training Loss: -0.009380221366882324\n",
      "==========================================\n",
      "Epoch:  5 / 500\n",
      "-----------\n",
      "Number of training episodes: 2\n",
      "Total reward: 1048.0\n",
      "Mean Reward of that batch 524.0\n",
      "Average Reward of all training: 545.3333333333333\n",
      "Max reward for a batch so far: 1848.0\n",
      "Training Loss: -0.009533533826470375\n",
      "==========================================\n",
      "Epoch:  6 / 500\n",
      "-----------\n",
      "Number of training episodes: 2\n",
      "Total reward: 1016.0\n",
      "Mean Reward of that batch 508.0\n",
      "Average Reward of all training: 539.1111111111111\n",
      "Max reward for a batch so far: 1848.0\n"
     ]
    }
   ],
   "source": [
    "# Keep track of all rewards total for each batch\n",
    "allRewards = []\n",
    "\n",
    "total_rewards = 0\n",
    "maximumRewardRecorded = 0\n",
    "mean_reward_total = []\n",
    "epoch = 1\n",
    "average_reward = []\n",
    "\n",
    "# Saver\n",
    "saver = tf.train.Saver()\n",
    "\n",
    "if training:\n",
    "    # Load the model\n",
    "    #saver.restore(sess, \"./models/model.ckpt\")\n",
    "\n",
    "    while epoch < num_epochs + 1:\n",
    "        # Gather training data\n",
    "        states_mb, actions_mb, rewards_of_batch, discounted_rewards_mb, nb_episodes_mb = make_batch(batch_size, stacked_frames)\n",
    "\n",
    "        ### These part is used for analytics\n",
    "        # Calculate the total reward ot the batch\n",
    "        total_reward_of_that_batch = np.sum(rewards_of_batch)\n",
    "        allRewards.append(total_reward_of_that_batch)\n",
    "\n",
    "        # Calculate the mean reward of the batch\n",
    "        # Total rewards of batch / nb episodes in that batch\n",
    "        mean_reward_of_that_batch = np.divide(total_reward_of_that_batch, nb_episodes_mb)\n",
    "        mean_reward_total.append(mean_reward_of_that_batch)\n",
    "\n",
    "        # Calculate the average reward of all training\n",
    "        # mean_reward_of_that_batch / epoch\n",
    "        average_reward_of_all_training = np.divide(np.sum(mean_reward_total), epoch)\n",
    "\n",
    "        # Calculate maximum reward recorded \n",
    "        maximumRewardRecorded = np.amax(allRewards)\n",
    "\n",
    "        print(\"==========================================\")\n",
    "        print(\"Epoch: \", epoch, \"/\", num_epochs)\n",
    "        print(\"-----------\")\n",
    "        print(\"Number of training episodes: {}\".format(nb_episodes_mb))\n",
    "        print(\"Total reward: {}\".format(total_reward_of_that_batch, nb_episodes_mb))\n",
    "        print(\"Mean Reward of that batch {}\".format(mean_reward_of_that_batch))\n",
    "        print(\"Average Reward of all training: {}\".format(average_reward_of_all_training))\n",
    "        print(\"Max reward for a batch so far: {}\".format(maximumRewardRecorded))\n",
    "\n",
    "        # Feedforward, gradient and backpropagation\n",
    "        loss_, _ = sess.run([PGNetwork.loss, PGNetwork.train_opt], feed_dict={PGNetwork.inputs_: states_mb.reshape((len(states_mb), 84,84,4)),\n",
    "                                                            PGNetwork.actions: actions_mb,\n",
    "                                                                     PGNetwork.discounted_episode_rewards_: discounted_rewards_mb \n",
    "                                                                    })\n",
    "\n",
    "        print(\"Training Loss: {}\".format(loss_))\n",
    "\n",
    "        # Write TF Summaries\n",
    "        summary = sess.run(write_op, feed_dict={PGNetwork.inputs_: states_mb.reshape((len(states_mb), 84,84,4)),\n",
    "                                                            PGNetwork.actions: actions_mb,\n",
    "                                                                     PGNetwork.discounted_episode_rewards_: discounted_rewards_mb,\n",
    "                                                                    PGNetwork.mean_reward_: mean_reward_of_that_batch\n",
    "                                                                    })\n",
    "\n",
    "        #summary = sess.run(write_op, feed_dict={x: s_.reshape(len(s_),84,84,1), y:a_, d_r: d_r_, r: r_, n: n_})\n",
    "        writer.add_summary(summary, epoch)\n",
    "        writer.flush()\n",
    "\n",
    "        # Save Model\n",
    "        if epoch % 10 == 0:\n",
    "            saver.save(sess, \"./models/model.ckpt\")\n",
    "            print(\"Model saved\")\n",
    "        epoch += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Watch our Agent play 👀\n",
    "Now that we trained our agent, we can test it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Saver\n",
    "saver = tf.train.Saver()\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    game = DoomGame()\n",
    "\n",
    "    # Load the correct configuration \n",
    "    game.load_config(\"health_gathering.cfg\")\n",
    "    \n",
    "    # Load the correct scenario (in our case basic scenario)\n",
    "    game.set_doom_scenario_path(\"health_gathering.wad\")\n",
    "    \n",
    "    # Load the model\n",
    "    saver.restore(sess, \"./models/model.ckpt\")\n",
    "    game.init()\n",
    "    \n",
    "    for i in range(10):\n",
    "        \n",
    "        # Launch a new episode\n",
    "        game.new_episode()\n",
    "\n",
    "        # Get a new state\n",
    "        state = game.get_state().screen_buffer\n",
    "        state, stacked_frames = stack_frames(stacked_frames, state, True)\n",
    "\n",
    "        while not game.is_episode_finished():\n",
    "        \n",
    "            # Run State Through Policy & Calculate Action\n",
    "            action_probability_distribution = sess.run(PGNetwork.action_distribution, \n",
    "                                                       feed_dict={PGNetwork.inputs_: state.reshape(1, *state_size)})\n",
    "\n",
    "            # REMEMBER THAT WE ARE IN A STOCHASTIC POLICY SO WE DON'T ALWAYS TAKE THE ACTION WITH THE HIGHEST PROBABILITY\n",
    "            # (For instance if the action with the best probability for state S is a1 with 70% chances, there is\n",
    "            #30% chance that we take action a2)\n",
    "            action = np.random.choice(range(action_probability_distribution.shape[1]), \n",
    "                                      p=action_probability_distribution.ravel())  # select action w.r.t the actions prob\n",
    "            action = possible_actions[action]\n",
    "\n",
    "            # Perform action\n",
    "            reward = game.make_action(action)\n",
    "            done = game.is_episode_finished()\n",
    "            \n",
    "            if done:\n",
    "                break\n",
    "            else:\n",
    "                # If not done, the next_state become the current state\n",
    "                next_state = game.get_state().screen_buffer\n",
    "                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n",
    "                state = next_state\n",
    "        \n",
    "\n",
    "        print(\"Score for episode \", i, \" :\", game.get_total_reward())\n",
    "    game.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:gameplai]",
   "language": "python",
   "name": "conda-env-gameplai-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
No results found