Skip to content

Instantly share code, notes, and snippets.

@awjuliani
Last active October 11, 2022 21:27
Show Gist options
  • Save awjuliani/b5d83fcf3bf2898656be5730f098e08b to your computer and use it in GitHub Desktop.
Save awjuliani/b5d83fcf3bf2898656be5730f098e08b to your computer and use it in GitHub Desktop.
A Policy-Gradient algorithm that solves Contextual Bandit problems.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Reinforcement Learning in Tensorflow Part 1.5: \n",
"## The Contextual Bandits\n",
"This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the contextual bandit problem. For more information, see this [Medium post](https://medium.com/p/bff01d1aad9c).\n",
"\n",
"For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import tensorflow.contrib.slim as slim\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Contextual Bandits\n",
"Here we define our contextual bandits. In this example, we are using three four-armed bandit. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such requires different actions to obtain the best result. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit-arm that will most often give a positive reward, depending on the Bandit presented."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class contextual_bandit():\n",
" def __init__(self):\n",
" self.state = 0\n",
" #List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.\n",
" self.bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],[-5,5,5,5]])\n",
" self.num_bandits = self.bandits.shape[0]\n",
" self.num_actions = self.bandits.shape[1]\n",
" \n",
" def getBandit(self):\n",
" self.state = np.random.randint(0,len(self.bandits)) #Returns a random state for each episode.\n",
" return self.state\n",
" \n",
" def pullArm(self,action):\n",
" #Get a random number.\n",
" bandit = self.bandits[self.state,action]\n",
" result = np.random.randn(1)\n",
" if result > bandit:\n",
" #return a positive reward.\n",
" return 1\n",
" else:\n",
" #return a negative reward.\n",
" return -1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Policy-Based Agent\n",
"The code below established our simple neural agent. It takes as input the current state, and returns an action. This allows the agent to take actions which are conditioned on the state of the environment, a critical step toward being able to solve full RL problems. The agent uses a single set of weights, within which each value is an estimate of the value of the return from choosing a particular arm given a bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class agent():\n",
" def __init__(self, lr, s_size,a_size):\n",
" #These lines established the feed-forward part of the network. The agent takes a state and produces an action.\n",
" self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)\n",
" state_in_OH = slim.one_hot_encoding(self.state_in,s_size)\n",
" output = slim.fully_connected(state_in_OH,a_size,\\\n",
" biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())\n",
" self.output = tf.reshape(output,[-1])\n",
" self.chosen_action = tf.argmax(self.output,0)\n",
"\n",
" #The next six lines establish the training proceedure. We feed the reward and chosen action into the network\n",
" #to compute the loss, and use it to update the network.\n",
" self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)\n",
" self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)\n",
" self.responsible_weight = tf.slice(self.output,self.action_holder,[1])\n",
" self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)\n",
" optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)\n",
" self.update = optimizer.minimize(self.loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the Agent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will train our agent by getting a state from the environment, take an action, and recieve a reward. Using these three things, we can know how to properly update our network in order to more often choose actions given states that will yield the highest rewards over time."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean reward for the 3 bandits: [ 0. -0.25 0. ]\n",
"Mean reward for the 3 bandits: [ 9. 42. 33.75]\n",
"Mean reward for the 3 bandits: [ 45.5 80. 67.75]\n",
"Mean reward for the 3 bandits: [ 86.25 116.75 101.25]\n",
"Mean reward for the 3 bandits: [ 122.5 153.25 139.5 ]\n",
"Mean reward for the 3 bandits: [ 161.75 186.25 179.25]\n",
"Mean reward for the 3 bandits: [ 201. 224.75 216. ]\n",
"Mean reward for the 3 bandits: [ 240.25 264. 250. ]\n",
"Mean reward for the 3 bandits: [ 280.25 301.75 285.25]\n",
"Mean reward for the 3 bandits: [ 317.75 340.25 322.25]\n",
"Mean reward for the 3 bandits: [ 356.5 377.5 359.25]\n",
"Mean reward for the 3 bandits: [ 396.25 415.25 394.75]\n",
"Mean reward for the 3 bandits: [ 434.75 451.5 430.5 ]\n",
"Mean reward for the 3 bandits: [ 476.75 490. 461.5 ]\n",
"Mean reward for the 3 bandits: [ 513.75 533.75 491.75]\n",
"Mean reward for the 3 bandits: [ 548.25 572. 527.5 ]\n",
"Mean reward for the 3 bandits: [ 587.5 610.75 562. ]\n",
"Mean reward for the 3 bandits: [ 628.75 644.25 600.25]\n",
"Mean reward for the 3 bandits: [ 665.75 684.75 634.75]\n",
"Mean reward for the 3 bandits: [ 705.75 719.75 668.25]\n",
"The agent thinks action 4 for bandit 1 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 2 for bandit 2 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 1 for bandit 3 is the most promising....\n",
"...and it was right!\n"
]
}
],
"source": [
"tf.reset_default_graph() #Clear the Tensorflow graph.\n",
"\n",
"cBandit = contextual_bandit() #Load the bandits.\n",
"myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.\n",
"weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network.\n",
"\n",
"total_episodes = 10000 #Set total number of episodes to train agent on.\n",
"total_reward = np.zeros([cBandit.num_bandits,cBandit.num_actions]) #Set scoreboard for bandits to 0.\n",
"e = 0.1 #Set the chance of taking a random action.\n",
"\n",
"init = tf.initialize_all_variables()\n",
"\n",
"# Launch the tensorflow graph\n",
"with tf.Session() as sess:\n",
" sess.run(init)\n",
" i = 0\n",
" while i < total_episodes:\n",
" s = cBandit.getBandit() #Get a state from the environment.\n",
" \n",
" #Choose either a random action or one from our network.\n",
" if np.random.rand(1) < e:\n",
" action = np.random.randint(cBandit.num_actions)\n",
" else:\n",
" action = sess.run(myAgent.chosen_action,feed_dict={myAgent.state_in:[s]})\n",
" \n",
" reward = cBandit.pullArm(action) #Get our reward for taking an action given a bandit.\n",
" \n",
" #Update the network.\n",
" feed_dict={myAgent.reward_holder:[reward],myAgent.action_holder:[action],myAgent.state_in:[s]}\n",
" _,ww = sess.run([myAgent.update,weights], feed_dict=feed_dict)\n",
" \n",
" #Update our running tally of scores.\n",
" total_reward[s,action] += reward\n",
" if i % 500 == 0:\n",
" print \"Mean reward for each of the \" + str(cBandit.num_bandits) + \" bandits: \" + str(np.mean(total_reward,axis=1))\n",
" i+=1\n",
"for a in range(cBandit.num_bandits):\n",
" print \"The agent thinks action \" + str(np.argmax(ww[a])+1) + \" for bandit \" + str(a+1) + \" is the most promising....\"\n",
" if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):\n",
" print \"...and it was right!\"\n",
" else:\n",
" print \"...and it was wrong!\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@muik
Copy link

muik commented Oct 12, 2016

A below line occurs an error. Tensorflow version on my local is 0.11.

output = slim.fully_connected(state_in_OH,a_size,\
            biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones)
Traceback (most recent call last):
  File "c_bendits.py", line 50, in <module>
    myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.
  File "c_bendits.py", line 34, in __init__
    biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones)
...
  File "/Library/Python/2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 666, in <lambda>
    shape.as_list(), dtype=dtype, partition_info=partition_info)
TypeError: ones() got an unexpected keyword argument 'partition_info'

So the line should be below codes.

output = slim.fully_connected(state_in_OH,a_size,\
            biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer)

@alphashuro
Copy link

These tutorials of yours are quite awesome and i am really loving them. in addition to the work you have done so far, can i suggest that since this tutorial material is designed to be read by beginners as part of a tutorial, would it not be more accommodating to use full variable names that represent the data it contains? for example, the first thing i struggled with was figuring out what lr meant in the parameters to the agent class's init, and the conciseness of the shortened name made it even more difficult to find where it is used in the code. I guessed that s_size and a_size were state size and action size, but i think it was an unnecessary barrier to understanding the actual content, as well as state_in_OH (i.e. the OH).

What do you think about this suggestion? i am hoping it can help others learn and understand the content better

@alphashuro
Copy link

also, i didn't see anything on one hot encoding in the post, is it perhaps in one of your other posts?

@easwar1977
Copy link

This is a great demo. Can you also suggest how to I store the model as .h5 file, (like in Keras), and re-use it ?

@Riotpiaole
Copy link

@alphashuro agreed. I struggle the same prob what is OH mean?

@dargor
Copy link

dargor commented Oct 18, 2017

Hi, I'm learning RL with your articles, great work 👍

Here is a quick diff to use raw TF (as of 1.3) instead of slim :

-        state_in_OH = slim.one_hot_encoding(self.state_in, s_size)
-        output = slim.fully_connected(state_in_OH,
-                                      a_size,
-                                      biases_initializer=None,
-                                      activation_fn=tf.nn.sigmoid,
-                                      weights_initializer=ones)
+        state_in_OH = tf.one_hot(self.state_in, s_size)
+        output = tf.layers.dense(state_in_OH, a_size, tf.nn.sigmoid,
+                                 use_bias=False, kernel_initializer=ones)

@Riotpiaole OH = one hot [encoding]

@lipixun
Copy link

lipixun commented Oct 30, 2017

According to my experiment (tensorflow 1.3), I suggest to use AdamOptimizer instead of GradientDescentOptimizer since GradientDescentOptimizer suffers from training stability issue.

@lipixun
Copy link

lipixun commented Oct 30, 2017

@Riotpiaole I've re-implement the tutorial codes here, you may take a look at it.

@pooriaPoorsarvi
Copy link

can anyone explain to me why we do not use softmax instead of sigmoid? and also why we don't use bias?(I tried both and it wouldn't work)

@pooriaPoorsarvi
Copy link

@lipixun do you know the answer to my question? it would really help me thanks

@JaeDukSeo
Copy link

@pooriaPoorsarvi as seen above we already got the responsible_weight variable, now we are getting the negative
Log likelihood to optimize for the maxium (tf only can optimize) no need to consider every other classes

@araknadash
Copy link

Instead of using slim, can use tf as:
state_in_OH = tf.one_hot(self.state_in, s_size)
output = tf.layers.dense(state_in_OH, a_size, tf.nn.sigmoid, use_bias=False, kernel_initializer = tf.ones_initializer())

@xkrishnam
Copy link

xkrishnam commented Jul 29, 2020

Thanks Arthur! this is helpful tutorial for beginers like me. Here is tensorflow 2 implementation may be helpful for someone

@daniel-xion
Copy link

Thanks Arthur! this is helpful tutorial for beginers like me. Here is tensorflow 2 implementation may be helpful for someone

Thanks for the implementation. I wonder how is the implementation a policy network? I don't see policy gradient is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment