Last active
May 3, 2016 06:01
-
-
Save domluna/8c1cb2a250a0746ea572785b0ff1057f to your computer and use it in GitHub Desktop.
CEM with decreasing noise on CartPole-v0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Implements the Cross-Entropy Method with decreasing noise added to the variance updates as described in [1]. | |
| Running cem.py with the default settings should reproduce results. | |
| [1] Szita, Lorincz 2006 | |
| http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """The main idea of CE (Cross Entropy) is to maintain a distribution | |
| of possible solution, and update this distribution accordingly. | |
| Preliminary investigation showed that applicability of CE to RL problems | |
| is restricted severly by the phenomenon that the distribution concentrates to | |
| a single point too fast. | |
| To prevent this issue, noise is added to the previous stddev/variance update | |
| calculation. | |
| This implements CE with decreasing noise as described in [1]. | |
| CE is implemented with decreasing variance noise | |
| max(5 - t / 10, 0), where t is the iteration step | |
| References: | |
| [1] Szita, Lorincz 2006 | |
| """ | |
| from __future__ import absolute_import | |
| from __future__ import print_function | |
| from __future__ import division | |
| from six.moves import range | |
| import gym | |
| import numpy as np | |
| import logging | |
| import argparse | |
| # only two possible actions 0 or 1 | |
| class BinaryActionLinearPolicy(object): | |
| def __init__(self, theta): | |
| self.w = theta[:-1] | |
| self.b = theta[-1] | |
| def act(self, ob): | |
| y = ob.dot(self.w) + self.b | |
| a = int(y < 0) | |
| return a | |
| def do_rollout(agent, env, num_steps, render=False): | |
| """ | |
| Performs actions for num_steps on the environment | |
| based on the agents current params | |
| """ | |
| total_rew = 0 | |
| ob = env.reset() | |
| for t in range(num_steps): | |
| a = agent.act(ob) | |
| (ob, reward, done, _) = env.step(a) | |
| total_rew += reward | |
| if render and t%3==0: env.render() | |
| if done: break | |
| return total_rew, t+1 | |
| # mean and std are 1D array of size d | |
| def cem(f, mean, var, n_iters, n_samples, top_frac): | |
| top_n = int(np.round(top_frac * n_samples)) | |
| for i in range(n_iters): | |
| # generate n_samples each iteration with new mean and stddev | |
| samples = np.transpose(np.array([np.random.normal(u, np.sqrt(o), n_samples) for u, o in zip(mean, var)])) | |
| ys = np.array([f(s) for s in samples]) | |
| # the top samples are the ones which give the lowest f evaluation results | |
| top_idxs = ys.argsort()[::-1][:top_n] | |
| top_samples = samples[top_idxs] | |
| # this is taken straight from [1], constant noise param | |
| # dependent on the iteration step. | |
| v = max(5 - i / 10, 0) | |
| mean = top_samples.mean(axis=0) | |
| var = top_samples.var(axis=0) + v | |
| yield {'ys': ys, 'theta_mean': mean, 'y_mean': ys.mean()} | |
| def evaluation_func(policy, env, num_steps): | |
| def f(theta): | |
| agent = policy(theta) | |
| rew, t = do_rollout(agent, env, num_steps, render=False) | |
| return rew | |
| return f | |
| if __name__ == '__main__': | |
| logger = logging.getLogger() | |
| logger.setLevel(logging.INFO) | |
| parser = argparse.ArgumentParser() | |
| parser.add_argument('--iters', default=50, type=int, help='number of iterations') | |
| parser.add_argument('--samples', default=30, type=int, help='number of samples CEM algorithm chooses from on each iter') | |
| parser.add_argument('--top_frac', default=0.2, type=float, help='percentage of top samples used to calculate mean and variance of next iteration') | |
| parser.add_argument('--seed', default=0, type=int, help='random seed') | |
| parser.add_argument('--outdir', default='CartPole-v0-CEM', type=str, help='output directory where results are saved') | |
| parser.add_argument('--render', default=True, type=bool, help='whether to show rendered results during training') | |
| args = parser.parse_args() | |
| np.random.seed(args.seed) | |
| env = gym.make('CartPole-v0') | |
| num_steps = 200 | |
| outdir = '/tmp/' + args.outdir | |
| env.monitor.start(outdir, force=True) | |
| f = evaluation_func(BinaryActionLinearPolicy, env, num_steps) | |
| # params for cem | |
| params = dict(n_iters=args.iters, n_samples=args.samples, top_frac=args.top_frac) | |
| u = np.random.randn(env.observation_space.shape[0]+1) | |
| var = np.square(np.ones_like(u) * 0.1) | |
| for (i, data) in enumerate(cem(f, u, var, **params)): | |
| print("Iteration {}. Episode mean reward: {}".format(i, data['y_mean'])) | |
| agent = BinaryActionLinearPolicy(data['theta_mean']) | |
| do_rollout(agent, env, num_steps, render=args.render) | |
| env.monitor.close() | |
| # make sure to setup your OPENAI_GYM_API_KEY environment variable | |
| gym.upload(outdir, algorithm_id='cem') |
some markup tests: bold italics inline code just playing around. ignore me. sorry. (testing markdown capability on gym.openai.com)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Had to comment out 'from six.moves import range' but got exactly the same results for the provided values
Played around with ended up with these values:
--iters = 20, --samples = 20, --top_frac = 0.1, v = max(5 - (i ** x / 10), 0)
That meant the problem could be consistently solved (no variance) in about in a much shorter time
I believe the key is the fact the problem is bounded at 200 steps, and so reducing the variance of the distribution quicker means you spend less time on unnecessary exploration
Episodes to solve with faster variance drop off (i = episode number):
v = max(5 - (i ** x / 10), 0):
I altered the number of iterations for a few because there was no variance between trials (not sure how that's possible when the pole has randomness in it's motion)
I also ran into an issue with higher iterations on line 64 where var = top_samples.var(axis=0) + v was adding 0 + 0 and giving an error. It might be worth having var = max(top_samples.var(axis=0) + v, 0.0001) so that if you do reach convergence (where there's no variance in the top results) then the program doesn't crash