filmo · July 31, 2017 06:41
diff --git a/Double DQN b/Double DQN
 Used a 2-layer Fully Connected network with H1=100, H2=60 and ReLU

 He Initialization of weights
 Adam Optimizer. Initial learning rate = 0.001 Learning rate reduced using gamma of 0.50 every 350 episodes

 Gamma = 0.99
 Eps   = 1.00
 Eps Decay = 0.98
 Eps Decay every new episode. (not each step)

 Uniform sampling from a replay buffer with 150,000 memories. No learning for first 1500 steps. (to fill replay buffer).
 I tried to implement prioritized experience replay, but couldn't get it to work. (yet)

 States normalized based on emperical mean and std of historical observation.

 No reward or gradient clipping.

 Target Network updated every 600 steps

 Smooth L1 Loss (Huber Loss) rather than MSE.

 Implemented in pyTorch and python 3.5. Trained on GTX-1070
	Used a 2-layer Fully Connected network with H1=100, H2=60 and ReLU

	He Initialization of weights
	Adam Optimizer. Initial learning rate = 0.001 Learning rate reduced using gamma of 0.50 every 350 episodes

	Gamma = 0.99
	Eps = 1.00
	Eps Decay = 0.98
	Eps Decay every new episode. (not each step)

	Uniform sampling from a replay buffer with 150,000 memories. No learning for first 1500 steps. (to fill replay buffer).
	I tried to implement prioritized experience replay, but couldn't get it to work. (yet)

	States normalized based on emperical mean and std of historical observation.

	No reward or gradient clipping.

	Target Network updated every 600 steps

	Smooth L1 Loss (Huber Loss) rather than MSE.

	Implemented in pyTorch and python 3.5. Trained on GTX-1070