Created
April 2, 2016 06:27
-
-
Save hausdorff/ea02d45b9744a69c3442693c25d79c29 to your computer and use it in GitHub Desktop.
deep q-learning
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Deep Q-Learning (Space Invaders) | |
09 MAR 2016 | |
Ever since I learned about neural networks playing Atari games I wanted to reimplemnted it and learn how it works. Below you can see an AI playing Space Invaders. I trained it during my batch at Recurse Center on little over 50M frames. | |
0:08 | |
It is more awesome if you realize that the AI was trained in a similar way a human would learn: the only inputs are screen and number of gained (or lost) points after each action taken by the AI. | |
DQN does much better then a best-action strategy (do nothing but shoot) and random strategy. | |
algorithm points | |
DQN 550 | |
Best action 240 | |
Random 150 | |
You can play with my implementation here: Deep Q-Learning. This is first post on the topic, stay tuned for the next ones! | |
0 | |
10,000 | |
20,000 | |
30,000 | |
40,000 | |
50,000 | |
60,000 | |
70,000 | |
80,000 | |
90,000 | |
0 | |
50 | |
100 | |
150 | |
200 | |
250 | |
300 | |
350 | |
400 | |
450 | |
500 | |
550 | |
600 | |
650 | |
Average game reward | |
Average game reward (600 games) after N games played. Blue line is random strategy baseline, red line is best-action strategy baseline. | |
#Algorithm So what is Deep Q-Learning (DQN)? Below you will find a gentle introduction. I omit certain details for the sake of simplicity and I encourage you to read the original paper. | |
The task for Neural Network in DQN is to learn which action to take based on the screen and previous experience. Obviously the nueral network should choose the best action but how to learn which one is best? | |
Turns out your neural network can be pretty simple: the input is game screen and hidden layers consists of 3 convolutional layers and a single fully connected layer. The number of neurons in last layer corresponds to number of actions that can be taken. In the case of Space Invaders there were 4 actions (do nothing, shoot, go left, go right), therefore there were 4 neurons in the output layer. | |
The tricky and crucial part is the loss function. In ‘normal’ neural networks the loss function is straigtforward as for each training example XX there is a known expected outcome yy. Nothing like that is available in our case but we can deal with it thanks to some insights from Q-Learning! | |
Loss function | |
The best measure of how good an action is accumulated future reward | |
∑i=t0ri | |
∑i=t0ri | |
where sum is taken over time from t0t0 until the end of the game and riri is reward gained at time ii. There is a couple of problems with that simplified definition and we’ll deal with them one by one. | |
For one there is no way to calculate that sum as we don’t know the future. With this, we’ll deal at the end though. | |
Intuitively speaking the immediate reward rt0rt0 should be more valuable then a very distant one. After all, future is uncertain and we might never get this distant reward at all. | |
Let’s introduce discounted accumulated future reward. | |
∑i=t0γiri | |
∑i=t0γiri | |
Where γγ is between 0 and 1. We’ll set γγ to 0.990.99, though, as the distant rewards are very important. | |
Finally our game is stochastic (we don’t know when an enemy shoots a laser beam) therefore we should rather think in terms of expected value. Ideally, what we want the neural network to learn is function Q defined as: | |
Q(s)(a)=𝔼(∑i=t0γiri)Expected discounted accumulated future reward | |
Q(s)(a)=E(∑i=t0γiri)Expected discounted accumulated future reward | |
where ss is the input game screen at time t0t0, aa indicates the neuron corresponding with action aa, riri is reward obtained after action taken at time ii. That is what we want each neuron of the output layer to learn. | |
To do that efficiently we need to realise that Q(s)(a)=r+γmaxa′Q(s′)(a′)Q(s)(a)=r+γmaxa′Q(s′)(a′) where s′s′ is game screen experienced after taking action aa after seeing game screen ss. | |
Now if Q∗Q∗ is our neural network we can treat Q∗(s)(a)−(r+γmaxa′Q∗(s′)(a′))Q∗(s)(a)−(r+γmaxa′Q∗(s′)(a′)) as a measure of surprise and therefore a loss function (after squaring). | |
Note that the loss depends on the neural network itself in an untypical way. | |
Correlation | |
Since we play the game online it is tempting to simply update the network after each taken action or in mini-batches of, say, 32 actions taken. Such mini-batches would be highly correlated and any stochastic update algorithm would fail on that. | |
To deal with that issue we keep previous experiences in memory and after each action taken we draw a mini-batch of experiences from that memory to perform the update step. | |
Other technicalities | |
provide only every 4th frame to the neural network | |
stack 4 frames one on top of the other to make the neural network aware of time. This could be avoided if you used LSTM. | |
keep a stale network at hand and calculate loss with regards to that stale network | |
gradient clipping (to avoid blowing up gradients) | |
DeepMind Rmsprop (instead of normal one) - improved performance by 40% in my case. | |
I used Arcade Learning Environment to play space invaders. | |
Lessons Learned | |
That was my first exposure to training non trivial neural networks so there is plenty of things that I learned. | |
Nan’s as weights is no good. Seems obvious but it does not mean that it’s easy to track down such problems. Theano provides means of doing that efficiently. Of course an NaN usually means that you divide ∞∞ by 00. | |
Test your Theano code. It’s not so hard! And here is relevant documentation. | |
If your network converges or diverges to ∞∞ very quickly it’s probably caused by suboptimal learning rates applied in your update function. Little is known about how to correctly choose network’s hyperparameters so trial, error and verification is what’s left. | |
Update method might play a gigantic role in performance of your neural network. In my case, learning curve of my DQN implementation flattened after 10M frames around 400 points for traditional RMSProp. “DeepMind” RmsProp was learning slower but boosted the performance to over 550 points on average after 50M frames and one can clearly see that it kept learning all the time. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment