Skip to content

Instantly share code, notes, and snippets.

@awjuliani
Last active July 14, 2019 16:24
Show Gist options
  • Save awjuliani/fffe41519166ee41a6bd5f5ce8ae2630 to your computer and use it in GitHub Desktop.
Save awjuliani/fffe41519166ee41a6bd5f5ce8ae2630 to your computer and use it in GitHub Desktop.
Implementation of Double Dueling Deep-Q Network
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond\n",
"\n",
"In this iPython notebook I implement a Deep Q-Network using both Double DQN and Dueling DQN. The agent learn to solve a navigation task in a basic grid world. To learn more, read here: https://medium.com/p/8438a3e2b8df\n",
"\n",
"For more reinforcment learning tutorials, as well as required gridworld.py file, see:\n",
"https://github.com/awjuliani/DeepRL-Agents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import gym\n",
"import numpy as np\n",
"import random\n",
"import tensorflow as tf\n",
"import matplotlib.pyplot as plt\n",
"import scipy.misc\n",
"import os\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the game environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feel free to adjust the size of the gridworld. Making it smaller provides an easier task for our DQN agent, while making the world larger increases the challenge."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"from gridworld import gameEnv\n",
"\n",
"env = gameEnv(partial=False,size=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above is an example of a starting environment in our simple game. The agent controls the blue square, and can move up, down, left, or right. The goal is to move to the green square (for +1 reward) and avoid the red square (for -1 reward). The position of the three blocks is randomized every episode."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implementing the network itself"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class Qnetwork():\n",
" def __init__(self,h_size):\n",
" #The network recieves a frame from the game, flattened into an array.\n",
" #It then resizes it and processes it through four convolutional layers.\n",
" self.scalarInput = tf.placeholder(shape=[None,21168],dtype=tf.float32)\n",
" self.imageIn = tf.reshape(self.scalarInput,shape=[-1,84,84,3])\n",
" self.conv1 = tf.contrib.layers.convolution2d( \\\n",
" inputs=self.imageIn,num_outputs=32,kernel_size=[8,8],stride=[4,4],padding='VALID', biases_initializer=None)\n",
" self.conv2 = tf.contrib.layers.convolution2d( \\\n",
" inputs=self.conv1,num_outputs=64,kernel_size=[4,4],stride=[2,2],padding='VALID', biases_initializer=None)\n",
" self.conv3 = tf.contrib.layers.convolution2d( \\\n",
" inputs=self.conv2,num_outputs=64,kernel_size=[3,3],stride=[1,1],padding='VALID', biases_initializer=None)\n",
" self.conv4 = tf.contrib.layers.convolution2d( \\\n",
" inputs=self.conv3,num_outputs=512,kernel_size=[7,7],stride=[1,1],padding='VALID', biases_initializer=None)\n",
" \n",
" #We take the output from the final convolutional layer and split it into separate advantage and value streams.\n",
" self.streamAC,self.streamVC = tf.split(3,2,self.conv4)\n",
" self.streamA = tf.contrib.layers.flatten(self.streamAC)\n",
" self.streamV = tf.contrib.layers.flatten(self.streamVC)\n",
" self.AW = tf.Variable(tf.random_normal([h_size/2,env.actions]))\n",
" self.VW = tf.Variable(tf.random_normal([h_size/2,1]))\n",
" self.Advantage = tf.matmul(self.streamA,self.AW)\n",
" self.Value = tf.matmul(self.streamV,self.VW)\n",
" \n",
" #Then combine them together to get our final Q-values.\n",
" self.Qout = self.Value + tf.sub(self.Advantage,tf.reduce_mean(self.Advantage,reduction_indices=1,keep_dims=True))\n",
" self.predict = tf.argmax(self.Qout,1)\n",
" \n",
" #Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.\n",
" self.targetQ = tf.placeholder(shape=[None],dtype=tf.float32)\n",
" self.actions = tf.placeholder(shape=[None],dtype=tf.int32)\n",
" self.actions_onehot = tf.one_hot(self.actions,env.actions,dtype=tf.float32)\n",
" \n",
" self.Q = tf.reduce_sum(tf.mul(self.Qout, self.actions_onehot), reduction_indices=1)\n",
" \n",
" self.td_error = tf.square(self.targetQ - self.Q)\n",
" self.loss = tf.reduce_mean(self.td_error)\n",
" self.trainer = tf.train.AdamOptimizer(learning_rate=0.0001)\n",
" self.updateModel = self.trainer.minimize(self.loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Experience Replay"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This class allows us to store experies and sample then randomly to train the network."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class experience_buffer():\n",
" def __init__(self, buffer_size = 50000):\n",
" self.buffer = []\n",
" self.buffer_size = buffer_size\n",
" \n",
" def add(self,experience):\n",
" if len(self.buffer) + len(experience) >= self.buffer_size:\n",
" self.buffer[0:(len(experience)+len(self.buffer))-self.buffer_size] = []\n",
" self.buffer.extend(experience)\n",
" \n",
" def sample(self,size):\n",
" return np.reshape(np.array(random.sample(self.buffer,size)),[size,5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a simple function to resize our game frames."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def processState(states):\n",
" return np.reshape(states,[21168])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These functions allow us to update the parameters of our target network with those of the primary network."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def updateTargetGraph(tfVars,tau):\n",
" total_vars = len(tfVars)\n",
" op_holder = []\n",
" for idx,var in enumerate(tfVars[0:total_vars/2]):\n",
" op_holder.append(tfVars[idx+total_vars/2].assign((var.value()*tau) + ((1-tau)*tfVars[idx+total_vars/2].value())))\n",
" return op_holder\n",
"\n",
"def updateTarget(op_holder,sess):\n",
" for op in op_holder:\n",
" sess.run(op)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the network"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting all the training parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"batch_size = 32 #How many experiences to use for each training step.\n",
"update_freq = 4 #How often to perform a training step.\n",
"y = .99 #Discount factor on the target Q-values\n",
"startE = 1 #Starting chance of random action\n",
"endE = 0.1 #Final chance of random action\n",
"anneling_steps = 10000. #How many steps of training to reduce startE to endE.\n",
"num_episodes = 10000 #How many episodes of game environment to train network with.\n",
"pre_train_steps = 10000 #How many steps of random actions before training begins.\n",
"max_epLength = 50 #The max allowed length of our episode.\n",
"load_model = False #Whether to load a saved model.\n",
"path = \"./dqn\" #The path to save our model to.\n",
"h_size = 512 #The size of the final convolutional layer before splitting it into Advantage and Value streams.\n",
"tau = 0.001 #Rate to update target network toward primary network"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"mainQN = Qnetwork(h_size)\n",
"targetQN = Qnetwork(h_size)\n",
"\n",
"init = tf.initialize_all_variables()\n",
"\n",
"saver = tf.train.Saver()\n",
"\n",
"trainables = tf.trainable_variables()\n",
"\n",
"targetOps = updateTargetGraph(trainables,tau)\n",
"\n",
"myBuffer = experience_buffer()\n",
"\n",
"#Set the rate of random action decrease. \n",
"e = startE\n",
"stepDrop = (startE - endE)/anneling_steps\n",
"\n",
"#create lists to contain total rewards and steps per episode\n",
"jList = []\n",
"rList = []\n",
"total_steps = 0\n",
"\n",
"#Make a path for our model to be saved in.\n",
"if not os.path.exists(path):\n",
" os.makedirs(path)\n",
"\n",
"with tf.Session() as sess:\n",
" if load_model == True:\n",
" print 'Loading Model...'\n",
" ckpt = tf.train.get_checkpoint_state(path)\n",
" saver.restore(sess,ckpt.model_checkpoint_path)\n",
" sess.run(init)\n",
" updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.\n",
" for i in range(num_episodes):\n",
" episodeBuffer = experience_buffer()\n",
" #Reset environment and get first new observation\n",
" s = env.reset()\n",
" s = processState(s)\n",
" d = False\n",
" rAll = 0\n",
" j = 0\n",
" #The Q-Network\n",
" while j < max_epLength: #If the agent takes longer than 200 moves to reach either of the blocks, end the trial.\n",
" j+=1\n",
" #Choose an action by greedily (with e chance of random action) from the Q-network\n",
" if np.random.rand(1) < e or total_steps < pre_train_steps:\n",
" a = np.random.randint(0,4)\n",
" else:\n",
" a = sess.run(mainQN.predict,feed_dict={mainQN.scalarInput:[s]})[0]\n",
" s1,r,d = env.step(a)\n",
" s1 = processState(s1)\n",
" total_steps += 1\n",
" episodeBuffer.add(np.reshape(np.array([s,a,r,s1,d]),[1,5])) #Save the experience to our episode buffer.\n",
" \n",
" if total_steps > pre_train_steps:\n",
" if e > endE:\n",
" e -= stepDrop\n",
" \n",
" if total_steps % (update_freq) == 0:\n",
" trainBatch = myBuffer.sample(batch_size) #Get a random batch of experiences.\n",
" #Below we perform the Double-DQN update to the target Q-values\n",
" Q1 = sess.run(mainQN.predict,feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,3])})\n",
" Q2 = sess.run(targetQN.Qout,feed_dict={targetQN.scalarInput:np.vstack(trainBatch[:,3])})\n",
" end_multiplier = -(trainBatch[:,4] - 1)\n",
" doubleQ = Q2[range(batch_size),Q1]\n",
" targetQ = trainBatch[:,2] + (y*doubleQ * end_multiplier)\n",
" #Update the network with our target values.\n",
" _ = sess.run(mainQN.updateModel, \\\n",
" feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.targetQ:targetQ, mainQN.actions:trainBatch[:,1]})\n",
" \n",
" updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.\n",
" rAll += r\n",
" s = s1\n",
" \n",
" if d == True:\n",
"\n",
" break\n",
" \n",
" myBuffer.add(episodeBuffer.buffer)\n",
" jList.append(j)\n",
" rList.append(rAll)\n",
" #Periodically save the model. \n",
" if i % 1000 == 0:\n",
" saver.save(sess,path+'/model-'+str(i)+'.cptk')\n",
" print \"Saved Model\"\n",
" if len(rList) % 10 == 0:\n",
" print total_steps,np.mean(rList[-10:]), e\n",
" saver.save(sess,path+'/model-'+str(i)+'.cptk')\n",
"print \"Percent of succesful episodes: \" + str(sum(rList)/num_episodes) + \"%\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Checking network learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mean reward over time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"rMat = np.resize(np.array(rList),[len(rList)/100,100])\n",
"rMean = np.average(rMat,1)\n",
"plt.plot(rMean)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@mphielipp
Copy link

I'm getting this message: -
----> 2 mainQN = Qnetwork(h_size)
---> 16 self.AW = tf.Variable(tf.random_normal([h_size/2,env.actions]))
---> 77 seed2=seed2)
--> 189 name=name)
--> 582 _Attr(op_def, input_arg.type_attr))
lib\site-packages\tensorflow\python\framework\op_def_library.py in _SatisfiesTypeConstraint(dtype, attr_def)
58 "DataType %s for attr '%s' not in list of allowed values: %s" %
59 (dtypes.as_dtype(dtype).name, attr_def.name,
---> 60 ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: DataType float32 for attr 'T' not in list of allowed values: int32, int64

@irwenqiang
Copy link

@mphielipp
I think you should check your version of TF first.
python -c 'import tensorflow as tf; print(tf.__version__)'
The version should be 0.12.x

@Xan-Kun
Copy link

Xan-Kun commented Feb 13, 2017

@mphielipp Did it work for you? I installed the latest tensorflow 0.12.1 and
pip show tensorflow says 0.12.1
but I still get the same error as you.

@tropical32
Copy link

tropical32 commented Feb 13, 2017

@mphielipp Replace that line with:
self.AW = tf.Variable(tf.random_normal([h_size // 2, env.actions]))
It expects an integer, not a float.

@nathanin
Copy link

nathanin commented Sep 8, 2017

Hi, First thanks so much for your detailed write ups and commented implementations. I have been working through them while developing my own RL environment outside of gym.

I have a few questions regarding the implementation for Double-DQN here:

  • The Double-DQN paper (https://arxiv.org/pdf/1511.06581.pdf) algorithm mentions updating \theta with each step t. It looks like the implementation here updates \theta every update_freq steps, and updates \theta- immediately afterwards. Is there something I don't understand? I guess it ends up being a heuristic decision when to perform these updates, just wondering what your intuition is for the \theta, \theta- update cycle.

  • Second is your nice tensorflow hack to update the targetQ weights. Does it rely on the order of initialization? Might there be a more verbose but explicit way to do it, maybe storing the targetQ ops by name in a dictionary?

  • Last is there a reason for not using a nonlinearity/activation in the network?

@samsenyang
Copy link

I would like to ask a question: do we have to split the inputs in order to achieve dueling DQN?
why can't i just input all the inputs into value layer and advantage layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment