awjuliani/SimplePolicy.ipynb

Created September 11, 2016 00:20

Star (19) You must be signed in to star a gist
Fork (17) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/awjuliani/902fe41c3a9efe27299e72aee1b3158c.js"></script>
Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.

Download ZIP

Policy gradient method for solving n-armed bandit problems.

Raw

SimplePolicy.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

bahriddin commented Apr 20, 2018

I tried with this details:

bandits = [-0.9, 0, -0.2, -1]
total_episodes = 100000
learning_rate=.01/total_episodes

But still, it can't find the global optimum. Are there any suggestions to improve algorithm?
Regards!

JaeDukSeo commented Jun 28, 2018

One of the reason why this example might be confusing is due to the fact that tf can only minimize when performing auto differentiation. Thats that why the prob is flipped -5 being the best prop.

JaeDukSeo commented Jun 28, 2018

@bahriddin that is due to the first selection choice, remember we initialize all of the weight to be one hence the argmax is 0. And since e is 0.1 small number we are not gonna explore that much, hence the agent will most likely choose the first one always and be wrong. If you increase the e value than it will be good.

dhl8282 commented Jul 10, 2018 •

edited

Loading

@fredthedead

About

#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
bandits = [0.2,0,-0.2,-5]

pullBandit method is defined as

def pullBandit(bandit):
#Get a random number.
result = np.random.randn(1)
if result > bandit:
#return a positive reward.
return 1
else:
#return a negative reward.
return -1

if you look carefully, result gives you a random positive or negative number.
Since bandits[3] = -5 which is more generous offset than bandits[1]=0, bandits[3] gives best chance.
Try this code and you will get is
for i in range(100): print np.random.randn(1)

mapa17 commented Sep 1, 2018

Hi,

I have troubles to understand how the optimizer can tune the weights variable.

To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.

loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0

weights = tf.Variable(tf.ones([num_bandits]))
...
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)

What do I miss or get wrong?

thx,
Manuel

wzzhu commented Oct 29, 2019 •

edited

Loading

To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.

loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0

What do I miss or get wrong?

f(x) = 0 doesn't mean f'(0) = 0, as long as the gradient of loss is not 0, eventually the weights will change.

In the example, as gradient(loss) = gradient(-log(weight)*reward)) = - reward * 1/weight (since d[lnx, x] = 1/x) and reward is const)
so gradient(loss) for (1) = -reward * 1/1 = -reward.

if reward == 1 (positive feedback), gradient = -1, so by gradient descend, it will subtract learning_rate * gradient, which is equivalent to adding learning_rate 0.001, so the new weight will become 1.001, giving it a little higher chance to be selected by argmax(weights). And so on.

awjuliani/SimplePolicy.ipynb

bahriddin commented Apr 20, 2018

Uh oh!

JaeDukSeo commented Jun 28, 2018

Uh oh!

JaeDukSeo commented Jun 28, 2018

Uh oh!

dhl8282 commented Jul 10, 2018 •

edited

Loading

Uh oh!

mapa17 commented Sep 1, 2018

Uh oh!

wzzhu commented Oct 29, 2019 •

edited

Loading

Uh oh!

awjuliani/SimplePolicy.ipynb

bahriddin commented Apr 20, 2018

Uh oh!

JaeDukSeo commented Jun 28, 2018

Uh oh!

JaeDukSeo commented Jun 28, 2018

Uh oh!

dhl8282 commented Jul 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapa17 commented Sep 1, 2018

Uh oh!

wzzhu commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhl8282 commented Jul 10, 2018 •

edited

Loading

wzzhu commented Oct 29, 2019 •

edited

Loading