Last active
September 29, 2022 06:17
-
-
Save xkrishnam/d9a62d52d28eb943c3965c6cf631ad30 to your computer and use it in GitHub Desktop.
tensorflow 2 implementation of Policy gradient method for solving n-armed bandit problems.
Context here simply means that algorithm also considers information about the state of environment (context) to generate actions for getting higher rewards (i.e. not only generating random actions and optimizing loss).
2nd you can add more layers but output from last layer should be number_of_bandicts * number_of_possible_actions that means you can put layers before current first layer (i.e. layer1
).
Or you can make code more generic to use it more effectively as when I coded it my only intention was to covert existing TF1 solution using TF2.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks for the code. May I know what is the context in this case (since it is called contextual)?
Also, I tried to add several more layers to the neural network (so it works better for large number of bandits), but was unable to do so correctly.
In the above code, the get_weights() function for ww returns a 3x4 array. If hidden layer of other sizes are added (let's say one with size 8), the ww.get_weights() for the last layer gives 8x4 array. Then one cannot use np.argmax(ww[0][a]) to find the best action from this new array.