Last active
September 29, 2022 06:17
-
-
Save xkrishnam/d9a62d52d28eb943c3965c6cf631ad30 to your computer and use it in GitHub Desktop.
tensorflow 2 implementation of Policy gradient method for solving n-armed bandit problems.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Context here simply means that algorithm also considers information about the state of environment (context) to generate actions for getting higher rewards (i.e. not only generating random actions and optimizing loss).
2nd you can add more layers but output from last layer should be number_of_bandicts * number_of_possible_actions that means you can put layers before current first layer (i.e.
layer1
).Or you can make code more generic to use it more effectively as when I coded it my only intention was to covert existing TF1 solution using TF2.