Deep Reinforcement Learning

Reinforcement learning is somewhat in-between supervised and unsupervised learning.
In reinforcement learning one has sparse and time-delayed labels – the rewards.
Credit assignment problem – i.e., which of the preceding actions was responsible for getting the reward and to what extent.
Explore-exploit dilemma – should you exploit the known working strategy or explore other, possibly better strategies.

One of the most common ways to formalize a reinforcement learning problem (so that it can be reasoned about) is a method called the Markov decesion process.
Reinforcement learning utilizes the agent-environment feedback loop, where an environment presents an agent with a state and a reward. The agent then provides an action to the environment as a function of the state and reward (often also including past states and rewards), causing the environment to update and provide new a new state and reward to the agent in a feedback loop.
The rules for how you choose those actions are called policy.
The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process.
One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards: s0, a0, r1, s1, a1, r2, s2, a2,... sn-1, an-1, rn, sn
A Markov decision process relies on the Markov assumption, that the probability of the next state si+1 depends only on current state si and action ai, but not on preceding states or actions.
To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get.

Given one run of the Markov decision process, we can calculate the total reward as the sum of the reward at all previous states: R = r1 + r2 + r3...
Given this, we can predict total future reward from time point t onwards with Rt = rt + (rt+1) + (rt+2) + (rt+3)...
Because many environments are stochastic, we must discount our reward as t increases. This is called discounted future reward.
For this we use gamma γ with a discount factor of 0-1 and multiply each rt by γ so that as t increases the corresponding reward rt gets taken less into consideration: Rt = rt + γ((rt+1) + γ((rt + 2)...))
If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1.
A good strategy for an agent is to always choose an action that maximizes the (discounted) future reward.

In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on.
It is called Q-function, because it represents the "quality" of a certain action in a given state.
The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation.
Bellman equation considers maximum future reward for a state and action is the immediate reward plus maximum future reward for the next state.
α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. α=1 means the update is the full Bellman equation.
Q(s',a') (' indicates future) that updates Q(s,a) is only an approximation and in early stages of learning it may be completely wrong. However the approximation will get more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.
Q-learning attempts to solve the credit assignment problem – it propagates rewards back in time, until it reaches the crucial decision point which was the actual cause for the obtained reward.

Uses a convolutional NN to extract features and create domain invarient input (i.e. pixels only) as game state s instead of domain specific input like paddle and ball position or dead/alive blocks.

During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random minibatches from the replay memory are used instead of the most recent transition.
This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum.
One could actually collect all those experiences from human gameplay and then train network on these.

By default, Q-learning's Q-table/Q-network is initialized randomly, and so exploration is built into the algorithm. However, this exploration is "greedy", meaning that it settles with the first effective strategy that it finds and tries to optimize it.
A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.

brannondorsey/deep_q_learning_notes.md