Skip to content

Instantly share code, notes, and snippets.

@joschu
joschu / 1-trpo-gae-v0-writeup-UPDATE.md
Last active August 15, 2016 02:27
Ran trpo-gae-v0 on new environments

Same exact code and parameters as https://gist.github.com/joschu/e42a050b1eb5cfbb1fdc667c3450467a but I ran it on the updated (v1) Mujoco environments. The new scripts are provided below. Ran on commit 987cb5d229027045fd0390533832e173237f81b6 but there shouldn't be any functional differences from the previous writeup.

Also, I (inadvertently) ran everything for 500 iterations instead of 250.

This is a tiny update to https://gist.github.com/joschu/a21ed1259d3f8c7bdff178fb47bc6fc1#file-1-cem-v0-writeup-md

  • I ran experiments on the v1 mujoco environments
  • I reduced the added noise extra_std parameter from 0.01 to 0.001

I used the cross-entropy method (an evolutionary algorithm / derivative free optimization method) to optimize small two-layer neural networks.

Code used to obtain these results can be found at the url https://github.com/joschu/modular_rl, commit ba42955b41d7f419470a95d875af1ab7e7ee66fc. The command line expression used for all the environments can be found in the text file below.

I used the cross-entropy method (an evolutionary algorithm / derivative free optimization method) to optimize small two-layer neural networks.

Code used to obtain these results can be found at the url https://github.com/joschu/modular_rl, commit 3324639f82a81288e9d21ddcb6c2a37957cdd361. The command line expression used for all the environments can be found in the text file below. Note that the same exact parameters were used for all tasks. The important parameters are:

  • hid_sizes=10,5: hidden layer sizes of MLP
  • extra_std=0.01: noise added to variance, see [1]
@joschu
joschu / 1-trpo-gae-v0-writeup.md
Last active April 20, 2024 17:30
TRPO-GAE (Version 0) Writeup

Code used to obtain these results can be found at the url https://github.com/joschu/modular_rl, commit 50cdfdf375e69d86e3db6eb2ad0218ea6aebf371. The command line expression used for all the environments can be found in the text file below. Note that the same exact parameters and policies were used for all tasks, except for timesteps_per_batch, which was varied based on the difficulty of the task. The important parameters are:

  • gamma=0.995: discount
  • lam=0.97: see GAE paper for explanation
  • agent=TrpoAgent: name of the class, which specifies policy and value function architecture. In this case, we used two hidden layers of size 64, with tanh activations
  • cg_damping: multiple of the identity added for conjugate gradient
import numpy as np, theano.tensor as TT, theano
x = TT.scalar('x')
y = TT.scalar('y')
z = TT.mod(x**2, y)
# z = x**2+y**2
f = theano.function([x,y], z, allow_input_downcast=True)
dfdx = theano.function([x,y], TT.grad(z,x),allow_input_downcast=True)