-
-
Save carlos-aguayo/3df32b1f5f39353afa58fbc29f9227a2 to your computer and use it in GitHub Desktop.
Cart pole balancing solved using the Q learning algorithm. | |
https://gym.openai.com/envs/CartPole-v0 | |
https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA | |
To run: | |
python CartPole-v0.py |
import gym | |
import pandas as pd | |
import numpy as np | |
import random | |
# https://gym.openai.com/envs/CartPole-v0 | |
# Carlos Aguayo - [email protected] | |
class QLearner(object): | |
def __init__(self, | |
num_states=100, | |
num_actions=4, | |
alpha=0.2, | |
gamma=0.9, | |
random_action_rate=0.5, | |
random_action_decay_rate=0.99): | |
self.num_states = num_states | |
self.num_actions = num_actions | |
self.alpha = alpha | |
self.gamma = gamma | |
self.random_action_rate = random_action_rate | |
self.random_action_decay_rate = random_action_decay_rate | |
self.state = 0 | |
self.action = 0 | |
self.qtable = np.random.uniform(low=-1, high=1, size=(num_states, num_actions)) | |
def set_initial_state(self, state): | |
""" | |
@summary: Sets the initial state and returns an action | |
@param state: The initial state | |
@returns: The selected action | |
""" | |
self.state = state | |
self.action = self.qtable[state].argsort()[-1] | |
return self.action | |
def move(self, state_prime, reward): | |
""" | |
@summary: Moves to the given state with given reward and returns action | |
@param state_prime: The new state | |
@param reward: The reward | |
@returns: The selected action | |
""" | |
alpha = self.alpha | |
gamma = self.gamma | |
state = self.state | |
action = self.action | |
qtable = self.qtable | |
choose_random_action = (1 - self.random_action_rate) <= np.random.uniform(0, 1) | |
if choose_random_action: | |
action_prime = random.randint(0, self.num_actions - 1) | |
else: | |
action_prime = self.qtable[state_prime].argsort()[-1] | |
self.random_action_rate *= self.random_action_decay_rate | |
qtable[state, action] = (1 - alpha) * qtable[state, action] + alpha * (reward + gamma * qtable[state_prime, action_prime]) | |
self.state = state_prime | |
self.action = action_prime | |
return self.action | |
def cart_pole_with_qlearning(): | |
env = gym.make('CartPole-v0') | |
experiment_filename = './cartpole-experiment-1' | |
env.monitor.start(experiment_filename, force=True) | |
goal_average_steps = 195 | |
max_number_of_steps = 200 | |
number_of_iterations_to_average = 100 | |
number_of_features = env.observation_space.shape[0] | |
last_time_steps = np.ndarray(0) | |
cart_position_bins = pd.cut([-2.4, 2.4], bins=10, retbins=True)[1][1:-1] | |
pole_angle_bins = pd.cut([-2, 2], bins=10, retbins=True)[1][1:-1] | |
cart_velocity_bins = pd.cut([-1, 1], bins=10, retbins=True)[1][1:-1] | |
angle_rate_bins = pd.cut([-3.5, 3.5], bins=10, retbins=True)[1][1:-1] | |
def build_state(features): | |
return int("".join(map(lambda feature: str(int(feature)), features))) | |
def to_bin(value, bins): | |
return np.digitize(x=[value], bins=bins)[0] | |
learner = QLearner(num_states=10 ** number_of_features, | |
num_actions=env.action_space.n, | |
alpha=0.2, | |
gamma=1, | |
random_action_rate=0.5, | |
random_action_decay_rate=0.99) | |
for episode in xrange(50000): | |
observation = env.reset() | |
cart_position, pole_angle, cart_velocity, angle_rate_of_change = observation | |
state = build_state([to_bin(cart_position, cart_position_bins), | |
to_bin(pole_angle, pole_angle_bins), | |
to_bin(cart_velocity, cart_velocity_bins), | |
to_bin(angle_rate_of_change, angle_rate_bins)]) | |
action = learner.set_initial_state(state) | |
for step in xrange(max_number_of_steps - 1): | |
observation, reward, done, info = env.step(action) | |
cart_position, pole_angle, cart_velocity, angle_rate_of_change = observation | |
state_prime = build_state([to_bin(cart_position, cart_position_bins), | |
to_bin(pole_angle, pole_angle_bins), | |
to_bin(cart_velocity, cart_velocity_bins), | |
to_bin(angle_rate_of_change, angle_rate_bins)]) | |
if done: | |
reward = -200 | |
action = learner.move(state_prime, reward) | |
if done: | |
last_time_steps = np.append(last_time_steps, [int(step + 1)]) | |
if len(last_time_steps) > number_of_iterations_to_average: | |
last_time_steps = np.delete(last_time_steps, 0) | |
break | |
if last_time_steps.mean() > goal_average_steps: | |
print "Goal reached!" | |
print "Episodes before solve: ", episode + 1 | |
print u"Best 100-episode performance {} {} {}".format(last_time_steps.max(), | |
unichr(177), # plus minus sign | |
last_time_steps.std()) | |
break | |
env.monitor.close() | |
if __name__ == "__main__": | |
random.seed(0) | |
cart_pole_with_qlearning() |
line 100 you mixed up the order of the variables. Cartpole-v0 returns the observation in this order:
[cart_position, cart_velocity, pole_angle, angle_rate_of_change].
The value of pole_angle is bounded by -0.2 and 0.2, so with your current algorithm there exist only two intervals for the pole_angle that can be reached.
I tried doubling the number of intervals that can be reached by pole_angle (from two to four), and it doesn't learn :
https://gist.github.com/anonymous/70afe80acc3810cc6df50747b63b9203
Am I missing something ?
I would recommend removing the if done: reward = -200
from line 117. Altering the rewards is normally discouraged as it means the performance is less indicative of the ability of the algorithm, as it also uses knowledge unavailable to the agent.
As this problem is capped at 200 steps as well it means that the final state even if you do a perfect episode and reach 200 steps will receive a very strong negative reward. Replacing with if done and steps < 200:
would be better, but because it is adding knowledge the agent doesn't have of its environment I would still recommend removing the check altogether.
Another useful thing can be to reduce the random action rate per episode rather than per move. Even with only a 1% decay, after 400 moves (2 complete runs) the random action will go from 50% chance to below 1% chance. You could move self.random_action_rate *= self.random_action_decay_rate
from QLearner.move() into the if done:
check inside the episode run, and just have it as learner.random_action_rate *= learner.random_action_decay_rate
.
Thanks! I really appreciate your feedback! I'll update it based on it and repost.
@JKCooper removing the check makes no sense, if anything it is a shortcoming of the environment. Since the done
variable is not directly visible to the agent and the rewards are always +1 there is no reason for the agent to do anything. As in, the environment itself does not encode any type of goal whatsoever.
Unless you include some type of signal for what kind of state/action is undesirable, there is no reason for the agent to learn what to do. Right now everybody is doing this by looking at the rendered images and knowing implicitly what the problem is asking them to do.
Imagine you were the agent though. You receive a bunch of numbers. The environment says: good for you, here's +1! You then do something else. Good for you again. You never have any incentives to either do or don't do something.
You could say, the reward is implicitly encoded by the length of the episode. But this would pretty much say that whenever you lose, you enter in a terminal state which will give you zero reward forever. But since this is not included in the gym, you'd have to code this yourself. And at that point, might as well add the -200 reward check.
@blabby Good catch, I indeed mixed them. However changing them shouldn't affect much.
@JKCooper2 Great catch on the decay rate, I hadn't noticed it was decaying so quickly. By moving it to the set_initial_state
does what you suggest.
@JKCooper2 and @Svalorzen I was wondering about the rewards as well. Initially I was confused by the fact that the environment wouldn't return a negative reward when the cart failed to balance the pole. As in, that's definitely a state we didn't want to be in and the agent should learn not to get near it.
At the same time, I realize that without giving it a bad reward, the agent should learn to get as many positive rewards as possible, implying keeping the pole balanced for as long as possible.
Hi, spent some time this weekend trying your CartPole-v0, but for some reason it doesn't converge. Do you use some different variable values than above? The only thing that I changed for the code to work in python35 was xrange to range, everything else is intact. like @blabby said: Am I missing something?
Hi, I always get this error when I tried to run the script.
Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.
which is strange since the reset only called at the beginning for each episode, which then can only be run if the previous episode reached 'done' state from the step function. Any clue as to why this happens?
Thanks in advance!
@mitbal I'm getting the same error after ~35 episodes after changing to:
env = gym.wrappers.Monitor(env, experiment_filename, force=True)
Did you manage to fix the issue? I don't understand why it stops working.
@alexmcnulty I fixed the error by putting the code "env.close()" before env.reset().
Reproduced here: https://gym.openai.com/evaluations/eval_aHf1Kmc4QIKm5oPcJJToBA