RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.
The main characters of RL are the agent
and the environment
. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.
The agent also perceives a reward
signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.
RL terminology:
- states and observations,
- action spaces,
- policies,
- trajectories,
- different formulations of return,
- the RL optimization problem,
- value functions.
A state s is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation o is a partial description of a state, which may omit information.
In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. For instance, a visual observation could be represented by the RGB matrix of its pixel values; the state of a robot might be represented by its joint angles and velocities.
Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. In continuous spaces, actions are real-valued vectors.
Policy, as the agent’s behavior function, tells us which action to take in state s
. It is a mapping from state s to action a and can be either deterministic or stochastic.

Example: Deterministic Policies. Here is a code snippet for building a simple deterministic policy for a continuous action space in PyTorch, using the torch.nn package:
pi_net = nn.Sequential(
nn.Linear(obs_dim, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, act_dim)
)
This builds a multi-layer perceptron (MLP) network with two hidden layers of size 64 and \tanh activation functions. If obs is a Numpy array containing a batch of observations, pi_net can be used to obtain a batch of actions as follows:
obs_tensor = torch.as_tensor(obs, dtype=torch.float32)
actions = pi_net(obs_tensor)
The two most common kinds of stochastic policies in deep RL are categorical policies
and diagonal Gaussian policies
.
Categorical policies can be used in discrete action spaces, while diagonal Gaussian policies are used in continuous action spaces.
Two key computations are centrally important for using and training stochastic policies:
- sampling actions from the policy,
- and computing log likelihoods of particular actions,
$\log \pi_{\theta}(a|s)$ .
A multivariate Gaussian distribution (or multivariate normal distribution, if you prefer) is described by a mean vector,
A diagonal Gaussian policy always has a neural network that maps from observations to mean actions,
- The first way: There is a single vector of log standard deviations,
$\log \sigma$ , which is not a function of state: the$\log \sigma$ are standalone parameters. (You Should Know: our implementations of VPG, TRPO, and PPO do it this way.) - The second way: There is a neural network that maps from states to log standard deviations,
$\log \sigma_{\theta}(s)$ . It may optionally share some layers with the mean network.
A trajectory \tau is a sequence of states and actions in the world,
The very first state of the world, start-state distribution
, sometimes denoted by
The reward function R is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:
although frequently this is simplified to just a dependence on the current state,
The goal of the agent is to maximize some notion of cumulative reward over a trajectory, but this actually can mean a few things. We’ll notate all of these cases with
Return
- One kind of return is the
finite-horizon undiscounted return
, which is just the sum of rewards obtained in a fixed window of steps. - Another kind of return is the
infinite-horizon discounted return
, which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained.
While the line between these two formulations of return are quite stark in RL formalism, deep RL practice tends to blur the line a fair bit—for instance, we frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating value functions.
Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.

Value function measures the goodness of a state or how rewarding a state or an action is by a prediction of future reward. The future reward, also known as return, is a total sum of discounted rewards going forward.
It’s often useful to know the value of a state, or state-action pair. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used, one way or another, in almost every RL algorithm.

The Optimal Value Function
,
These are a set of equations that decompose the value function into the immediate reward plus the discounted future values.

On-Policy Backup
: Combines results using a weighted sum (the policy distribution).
Optimal Backup
: Picks the single best result (the maximum).
Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.
The advantage function
In more formal terms, almost all the RL problems can be framed as Markov Decision Processes (MDPs). All states in MDP has “Markov” property, referring to the fact that the future only depends on the current state, not the history:
Or in other words, the future and the past are conditionally independent given the present, as the current state encapsulates all the statistics we need to decide the future.