Last active
November 1, 2018 18:03
-
-
Save JKCooper2/b3543e665758ff4a2b7a3c82db97f336 to your computer and use it in GitHub Desktop.
[Open AI| CartPole v0 - Simulated Annealing v0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Based on treeforms solution here: https://gym.openai.com/evaluations/eval_LjL2QnWCRlmh58dLCpMUTg | |
Changes: | |
- Moved agent into it's own class so can be reused on other problems | |
- Each time the best score is hit the best values and updated towards the mean of the best values found so far (required because problem is bounded) | |
- Alpha decays when best score is matched (required because problem is bounded) | |
- Alpha increases if you neither match nor exceeed the best score (helps get around initial 'wrong' choices by random improvement | |
Improvements from changes: | |
- Repeats no longer needed (but included in class anyway in case it was useful for other problems) | |
- Time to find first complete solution reduced | |
- Variation after finding first complete solution reduced | |
- Time spent with incorrect values at the start from randomly moving in the 'wrong' direction reduced |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import logging | |
import gym | |
from SimulatedAnnealing import SimulatedAnnealingAgent | |
def main(): | |
logger = logging.getLogger() | |
logger.setLevel(logging.DEBUG) | |
env = gym.make('CartPole-v0') | |
agent = SimulatedAnnealingAgent(env.action_space, repeats=1, decay=0.9, spread=0.3) # Initialise agent | |
outdir = '/tmp/' + agent.name + '-results' | |
env.monitor.start(outdir, force=True) | |
episode_count = 200 | |
max_steps = 200 | |
reward = 0 | |
done = False | |
for i in xrange(episode_count): | |
ob = env.reset() | |
for j in xrange(max_steps): | |
action = agent.act(ob, reward, done) | |
ob, reward, done, _ = env.step(action) | |
if done: | |
break | |
# Dump result info to disk | |
env.monitor.close() | |
if __name__ == '__main__': | |
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import random | |
random.seed(0) # So scores are (slightly) more consistent. Randomness in pole counteracts this a bit | |
class SimulatedAnnealingAgent(object): | |
def __init__(self, action_space, repeats=10, alpha=1, decay=0.9, spread=0.5): | |
self.name = 'SimAnn' # Name to be submitted to OpenAI | |
self.action_space = action_space # Just for consistency with other agents, not used in this case | |
self.alpha = alpha # Learning rate | |
self.decay = decay # Decay in impact of alpha | |
self.spread = spread # Spread of randomness when selecting new values to test | |
self.repeats = repeats # Number of times to repeat testing a value | |
self.obs_count = 0 # Number of observation returned (can probably get from the environment somehow) | |
self.best = [] # Holds best values (set on first run of action) | |
self.test = [] # Holds test values | |
self.best_score = 0 # Current max score found | |
self.best_count = 0 # Times hit max score (used for bounded problems) | |
self.ep_score = 0 # Total score for episode | |
self.repeat_count = 0 # Times repeated running test | |
# Set the new test values at the start of the episode | |
def set_test(self): | |
# If less than required repeats than just run again | |
if self.repeat_count < self.repeats: | |
self.repeat_count += 1 | |
return self.test | |
# Else reset repeat count and set new values based on current best, spread and alpha | |
self.repeat_count = 0 | |
return [self.best[i] + (random.random() - self.spread) * self.alpha for i in range(self.obs_count)] | |
# Choose action based on observed values | |
def choose_action(self, observation): | |
if sum(observation[i] * self.test[i] for i in range(self.obs_count)) > 0: | |
return 1 | |
return 0 | |
# If get the same ep score then update best to average of all values that have reached the best score | |
def update_best(self): | |
self.best = [(self.best[i] * self.best_count + self.test[i])/(self.best_count + 1) for i in range(self.obs_count)] | |
self.best_count += 1 | |
# What gets called | |
def act(self, observation, reward, done): | |
# Set initial values if first time agent is seeing observations | |
if self.obs_count == 0: | |
self.obs_count = len(observation) | |
self.best = [0] * self.obs_count | |
self.test = self.best | |
# Set new test values for new episode | |
if self.ep_score == 0: | |
self.test = self.set_test() | |
# Select action | |
action = self.choose_action(observation) | |
# Update episode score | |
self.ep_score += reward | |
if done: | |
# If score is the same as best then the amount of variance in future choices goes down | |
# Set the new best to be the average of all the best scores so far (using incremental mean) | |
if self.ep_score == self.best_score: | |
self.alpha *= self.decay | |
self.update_best() | |
# If new score is greater then set everything to that | |
elif self.ep_score > self.best_score: | |
self.best_score = self.ep_score | |
self.best = self.test | |
self.best_count = 0 | |
self.alpha *= self.decay | |
# If new score isn't >= then increase the spread when selecting values | |
# This helps get around issues making incorrect starting decisions but can probably be improved | |
else: | |
self.alpha /= self.decay | |
self.ep_score = 0 | |
return action |
Reproduced here as well :
https://gym.openai.com/evaluations/eval_N2crbGpSOFVu3oRBdXvQ
Is there a paper that explains the details of simulated annealing w.r.t. to RL? A quick google search did not yield anything much useful.
Thanks.
Reproduced here as well:
https://gym.openai.com/evaluations/eval_qPRz1hb2SUOJrybCHKO4BA
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@gdb: 16 was the first uploaded score. I don't know of a way to determine the score when testing locally without manually coding it in, so I just used the average episode score over 200 eps I was training as my guide whos scores ranged from 170-190
@treeform: I think most of the variance from your original solution came from the alpha value when selecting new solutions only reduced when the new result was > the current best. Because the problem is bounded at 200 after the first time you hit any new values keep being set at the same alpha spread. Using the repeats makes it less likely to hit 200 early on so when you do hit it the alpha level has already decayed enough that your unlikely to start testing away from that.
Merging the best results and increasing alpha were to stop it getting stuck early only on in the process at values of around 10, but those steps can probably be improved which would reduce the variance of the solve episodes