Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active March 19, 2025 13:23
Show Gist options
  • Save AaradhyaSaxena/96aefac1c90bada965ea4fd33163b0b9 to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/96aefac1c90bada965ea4fd33163b0b9 to your computer and use it in GitHub Desktop.
RL

RL

RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

The main characters of RL are the agent and the environment. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.

RL terminology:

  • states and observations,
  • action spaces,
  • policies,
  • trajectories,
  • different formulations of return,
  • the RL optimization problem,
  • value functions.

States and Observations

A state s is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation o is a partial description of a state, which may omit information.

In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. For instance, a visual observation could be represented by the RGB matrix of its pixel values; the state of a robot might be represented by its joint angles and velocities.

Action Spaces

Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. In continuous spaces, actions are real-valued vectors.

Policies

Policy, as the agent’s behavior function, tells us which action to take in state s. It is a mapping from state s to action a and can be either deterministic or stochastic.

Screenshot 2025-03-08 at 5 02 10 AM
Deterministic Policies

Example: Deterministic Policies. Here is a code snippet for building a simple deterministic policy for a continuous action space in PyTorch, using the torch.nn package:

pi_net = nn.Sequential(
      nn.Linear(obs_dim, 64),
      nn.Tanh(),
      nn.Linear(64, 64),
      nn.Tanh(),
      nn.Linear(64, act_dim)
    )

This builds a multi-layer perceptron (MLP) network with two hidden layers of size 64 and \tanh activation functions. If obs is a Numpy array containing a batch of observations, pi_net can be used to obtain a batch of actions as follows:

obs_tensor = torch.as_tensor(obs, dtype=torch.float32)
actions = pi_net(obs_tensor)
Stochastic Policies

The two most common kinds of stochastic policies in deep RL are categorical policies and diagonal Gaussian policies.

Categorical policies can be used in discrete action spaces, while diagonal Gaussian policies are used in continuous action spaces.

Two key computations are centrally important for using and training stochastic policies:

  • sampling actions from the policy,
  • and computing log likelihoods of particular actions, $\log \pi_{\theta}(a|s)$.
Diagonal Gaussian policies

A multivariate Gaussian distribution (or multivariate normal distribution, if you prefer) is described by a mean vector, $\mu$, and a covariance matrix, $\Sigma$. A diagonal Gaussian distribution is a special case where the covariance matrix only has entries on the diagonal. As a result, we can represent it by a vector.

A diagonal Gaussian policy always has a neural network that maps from observations to mean actions, $\mu_{\theta}(s)$. There are two different ways that the covariance matrix is typically represented.

  • The first way: There is a single vector of log standard deviations, $\log \sigma$, which is not a function of state: the $\log \sigma$ are standalone parameters. (You Should Know: our implementations of VPG, TRPO, and PPO do it this way.)
  • The second way: There is a neural network that maps from states to log standard deviations, $\log \sigma_{\theta}(s)$. It may optionally share some layers with the mean network.

Trajectories

A trajectory \tau is a sequence of states and actions in the world,

$\tau = (s_0, a_0, s_1, a_1, ...).$

The very first state of the world, $s_0$, is randomly sampled from the start-state distribution, sometimes denoted by $\rho_0$:

$s_0 \sim \rho_0(\cdot).$

Reward and Return

The reward function R is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:

$r_t = R(s_t, a_t, s_{t+1})$

although frequently this is simplified to just a dependence on the current state, $r_t$ = $R(s_t)$, or state-action pair $r_t$ = $R(s_t,a_t)$.

The goal of the agent is to maximize some notion of cumulative reward over a trajectory, but this actually can mean a few things. We’ll notate all of these cases with $R(\tau)$, and it will either be clear from context which case we mean, or it won’t matter (because the same equations will apply to all cases).

Return

  • One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps.
  • Another kind of return is the infinite-horizon discounted return, which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained.

While the line between these two formulations of return are quite stark in RL formalism, deep RL practice tends to blur the line a fair bit—for instance, we frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating value functions.


The RL Problem

Whatever the choice of return measure (whether infinite-horizon discounted, or finite-horizon undiscounted), and whatever the choice of policy, the goal in RL is to select a policy which maximizes expected return when the agent acts according to it.

Screenshot 2025-03-08 at 6 25 04 AM

Value Functions

Value function measures the goodness of a state or how rewarding a state or an action is by a prediction of future reward. The future reward, also known as return, is a total sum of discounted rewards going forward.

It’s often useful to know the value of a state, or state-action pair. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used, one way or another, in almost every RL algorithm.

Screenshot 2025-03-08 at 6 37 25 AM

The Optimal Value Function, $V^*(s)$, which gives the expected return if you start in state s and always act according to the optimal policy in the environment.

Bellman Equations

These are a set of equations that decompose the value function into the immediate reward plus the discounted future values.

Screenshot 2025-03-08 at 6 56 13 AM

On-Policy Backup: Combines results using a weighted sum (the policy distribution). Optimal Backup: Picks the single best result (the maximum).

Advantage Functions

Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action a in state s, over randomly selecting an action according to $\pi(\cdot|s)$, assuming you act according to $\pi$ forever after. Mathematically, the advantage function is defined by

$A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)$.


Markov Decision Processes

In more formal terms, almost all the RL problems can be framed as Markov Decision Processes (MDPs). All states in MDP has “Markov” property, referring to the fact that the future only depends on the current state, not the history:

$P[S_{t+1}| S_t] = P[S_{t+1} | S_t, S_{t-1}, .... S_1]$

Or in other words, the future and the past are conditionally independent given the present, as the current state encapsulates all the statistics we need to decide the future.

A Taxonomy of RL Algorithms

Model-Free vs Model-Based RL

One of the most important branching points in an RL algorithm is the question of whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transitions and rewards.

The main upside to having a model is that it allows the agent to plan by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options. Agents can then distill the results from planning ahead into a learned policy. A particularly famous example of this approach is AlphaZero. When this works, it can result in a substantial improvement in sample efficiency over methods that don’t have a model.

The main downside is that a ground-truth model of the environment is usually not available to the agent. If an agent wants to use a model in this case, it has to learn the model purely from experience, which creates several challenges. The biggest challenge is that bias in the model can be exploited by the agent, resulting in an agent which performs well with respect to the learned model, but behaves sub-optimally (or super terribly) in the real environment. Model-learning is fundamentally hard, so even intense effort—being willing to throw lots of time and compute at it—can fail to pay off.

Algorithms which use a model are called model-based methods, and those that don’t are called model-free. While model-free methods forego the potential gains in sample efficiency from using a model, they tend to be easier to implement and tune.

What to Learn

Another critical branching point in an RL algorithm is the question of what to learn. The list of usual suspects includes

  • policies, either stochastic or deterministic,
  • action-value functions (Q-functions),
  • value functions,
  • and/or environment models.

What to Learn in Model-Free RL

There are two main approaches to representing and training agents with model-free RL:

  • Policy Optimization
  • Q-Learning

Policy Optimization

  • Definition: Directly optimizes a parameterized policy $\pi_{\theta}(a|s)$ by adjusting θ to maximize the expected return.
  • Examples: REINFORCE (Vanilla Policy Gradient), PPO, TRPO, A2C/A3C (actor-critic methods).
  • Key Feature: Learns the policy explicitly—the probability distribution over actions.

Q-Learning

  • Definition: Learns an action-value function Q(s,a) that estimates the expected return for taking action a in state s and following the optimal policy thereafter.
  • Examples: DQN, Double DQN, Rainbow DQN.
  • Key Feature: Learns the value of each action, then derives the policy by picking the argmax over Q(s,a).

Core Distinction

  • Policy Optimization = Optimize the policy directly.
  • Q-Learning = Optimize a value function and derive the policy implicitly by taking the best action according to Q

What to Learn in Model-Based RL

Unlike model-free RL, there aren’t a small number of easy-to-define clusters of methods for model-based RL: there are many orthogonal ways of using models.

Policy Optimization

The “loss” function in policy gradient methods differs fundamentally from typical supervised learning losses.

Data Distribution Depends on the Policy: $\pi_{\theta}$

  • In supervised learning, the data (input–label pairs) are fixed. You don’t change the dataset each time you update your model’s parameters.
  • In policy gradient RL, every time you update 𝜃, you typically need to re‐collect data under that new policy. The distribution of states and actions you see changes as the policy changes. -- Thus, the “loss” is not defined w.r.t. a constant dataset; it’s always entwined with 𝜃.

Doesn’t Measure the True Performance

  • In supervised learning, the loss (e.g., cross‐entropy) is usually a direct measure of how well the model is doing. When the cross‐entropy goes down, your model’s predictive accuracy typically goes up.
  • In policy gradient RL, this “loss” is not an estimate of the expected return. Even if you drive it to −∞ on a fixed batch, you might wreck the policy’s true performance

Imagine a Robot in a Maze

  • Your policy instructs the robot how to move. Initially, it might stumble around near the start. The data (trajectories) you gather show the robot mostly exploring the left side of the maze. You do a policy update based on that data. Now the policy improves slightly, so it’s more likely to move right when it senses it’s near a path toward the goal. Next time you run the robot, it might quickly navigate to the right side of the maze (closer to the goal), collecting data from a new region of the maze it has never visited before.

Interview Tip If asked, “Why is the policy gradient loss not a real performance metric?” you can answer: (1) “The data distribution depends on the policy parameters.” (2) “This function is only chosen to produce the correct gradient at each step, not to measure how good the policy is overall.”

Reward-to-go policy gradient

$\nabla_{\theta} J(\pi_{\theta}) = E{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau)}$

Problem with the Simple Form

  • In the first expression, each action’s probability is reinforced (increased) by the entire return R(τ) from the entire episode—even the rewards that happened before that action was taken. Why That’s Counterintuitive
  • If you get a big reward at the beginning of the episode, it’s unrelated to later actions. Yet the naive formula would “credit” those later actions for earlier rewards.

Agents should really only reinforce actions on the basis of their consequences. Rewards obtained before taking an action have no bearing on how good that action was: only rewards that come after.

It turns out that this intuition shows up in the math, and we can show that the policy gradient can also be expressed by

$\nabla_{\theta} J(\pi_{\theta}) = E{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})}$

$\sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})$ picks out only the rewards that follow action $a_{t'}$.

We’ll call this form the reward-to-go policy gradient, because the sum of rewards after a point in a trajectory,

Running One Epoch of Training

# for training policy
def train_one_epoch():
    # make some empty lists for logging.
    batch_obs = []          # for observations
    batch_acts = []         # for actions
    batch_weights = []      # for R(tau) weighting in policy gradient
    batch_rets = []         # for measuring episode returns
    batch_lens = []         # for measuring episode lengths

    # reset episode-specific variables
    obs = env.reset()       # first obs comes from starting distribution
    done = False            # signal from environment that episode is over
    ep_rews = []            # list for rewards accrued throughout ep

    # render first episode of each epoch
    finished_rendering_this_epoch = False

    # collect experience by acting in the environment with current policy
    while True:

        # rendering
        if (not finished_rendering_this_epoch) and render:
            env.render()

        # save obs
        batch_obs.append(obs.copy())

        # act in the environment
        act = get_action(torch.as_tensor(obs, dtype=torch.float32))
        obs, rew, done, _ = env.step(act)

        # save action, reward
        batch_acts.append(act)
        ep_rews.append(rew)

        if done:
            # if episode is over, record info about episode
            ep_ret, ep_len = sum(ep_rews), len(ep_rews)
            batch_rets.append(ep_ret)
            batch_lens.append(ep_len)

            # the weight for each logprob(a|s) is R(tau)
            batch_weights += [ep_ret] * ep_len

            # reset episode-specific variables
            obs, done, ep_rews = env.reset(), False, []

            # won't render again this epoch
            finished_rendering_this_epoch = True

            # end experience loop if we have enough of it
            if len(batch_obs) > batch_size:
                break

    # take a single policy gradient update step
    optimizer.zero_grad()
    batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
                              act=torch.as_tensor(batch_acts, dtype=torch.int32),
                              weights=torch.as_tensor(batch_weights, dtype=torch.float32)
                              )
    batch_loss.backward()
    optimizer.step()
    return batch_loss, batch_rets, batch_lens

Explanation

act = get_action(torch.as_tensor(obs, dtype=torch.float32))
obs, rew, done, _ = env.step(act)

Example

Imagine you have a robot learning to walk in a simple 2D environment: Current Observation

  • obs might be something like [0.5, -0.2, 1.3], representing the robot’s joint angles, velocities, etc. Policy Network
  • get_action(torch.tensor([0.5, -0.2, 1.3])) runs these numbers through a feed‐forward neural network that outputs a 2D action, say [torque_hip, torque_knee]. Action
  • Suppose it outputs act = [0.7, -0.1], meaning “apply +0.7 torque to the hip joint, -0.1 torque to the knee joint.” Environment Step
  • env.step([0.7, -0.1]) updates the robot’s position and returns the new observation (e.g., [0.6, -0.15, 1.4]), the reward (e.g., +1 if it’s still upright), and whether it fell over (done=False if not). Continue
  • This loop repeats: the robot keeps taking actions, the environment returns new states and rewards, until done=True (e.g., the robot fell over or the time limit expired).

Gymnasium

Environment Interface

The core methods are:

  • gymnasium.make(env_id)

    • Creates an environment instance for the specified environment ID (e.g., "CartPole-v1").
  • env.reset()

    • Resets the environment to an initial state; returns the initial observation (and sometimes extra info in Gymnasium).
  • env.step(action)

    • Applies the chosen action to the environment.
    • Returns a tuple (obs, reward, done, truncated, info):
      • obs: Next observation (state).
      • reward: Scalar reward from the last action.
      • done: Whether the episode is finished (success/fail).
      • truncated: Whether the episode was terminated due to time or other constraints.
      • info: Extra diagnostic data (e.g., state transitions, debugging info).
  • env.render() (optional)

    • Renders a visual representation of the environment state (e.g., a 2D or 3D view).
  • env.close()

    • Shuts down any external resources (windows, rendering processes).

3. Typical RL Loop

Here’s a simple pseudocode:

import gymnasium as gym

env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)

for _ in range(1000):
    # Choose an action (e.g., random or from a policy)
    action = env.action_space.sample()
    
    # Step the environment
    obs, reward, done, truncated, info = env.step(action)
    
    # (Optional) Render
    env.render()

    # If episode is done or truncated, reset the environment
    if done or truncated:
        obs, info = env.reset()
        
env.close()

4. Common Environments

  1. Classic Control: CartPole-v1, MountainCar-v0, Pendulum-v1

    • Low-dimensional state and discrete/continuous actions.
    • Great for getting started.
  2. Box2D: LunarLander-v2, BipedalWalker-v3 (requires gymnasium[box2d])

    • Physics-based 2D tasks with continuous or discrete actions.
  3. Atari: Pong-v5, Breakout-v5, etc. (requires gymnasium[atari])

    • High-dimensional image observations, typically used for Deep Q-Learning and other advanced methods.
  4. MuJoCo: HalfCheetah-v4, Hopper-v4, Ant-v4 (requires gymnasium[mujoco])

    • Continuous control tasks in 3D physics simulation.
    • Popular for benchmarking advanced RL algorithms like PPO, SAC, TD3.

Pass 3: Practical Tips & Best Practices

  1. Check Action/Observation Spaces

    • Each Gymnasium environment has env.action_space and env.observation_space.
    • These define the allowed range or discrete set of actions and the shape/limit of observations.
    • Always verify that your RL agent’s network architecture matches the dimensionality of env.observation_space.
  2. Seeding for Reproducibility

    • Use env.reset(seed=some_seed) to seed the environment for deterministic behavior when debugging.
    • Also set random seeds for other libraries (NumPy, PyTorch, etc.) if you need completely reproducible runs.

Q Learning

Quick explanation, we assume Q function to be a table for ease of understanding:

  • Goal: The agent wants to find the best path to the "+10 reward" in a grid. It needs to learn the best way to navigate the grid to maximize its reward.

  • Environment: Imagine a simple 3x3 game board. Some squares give positive points (+10), some give negative points (-10), and most give a small penalty (-1).

    Screenshot 2025-03-11 at 12 42 06 PM
  • Q-Table: The agent uses a table (called a Q-table) to remember what it has learned. Each cell in the table represents a state-action pair and contains a "Q-value". This value represents how good it is to take a certain action in a specific state. The goal of Q-learning is to learn these Q values such that the total reward is maximized.

    Screenshot 2025-03-11 at 12 43 49 PM
  • Exploration: Initially, the agent explores the board randomly. It doesn't yet know which actions are good or bad, so it tries different things. The agent won't just choose a state because it has a high Q value for example. This random exploration is guided by a "behavior policy".

  • Learning: After each action, the agent updates its Q-table using the Bellman equation, which defines a recursive relationship between Q values. The update considers the reward received and the potential future rewards. This update is influenced by the "learning rate" (alpha) and "discount factor" (gamma).

    Screenshot 2025-03-11 at 12 44 20 PM
    gamma = 0.1 (say)
    most profitable action for S2 >> 1.5 (down)
    
    Q(S1, right) = -1 + (0.1)*(1.5) = -0.85
    
  • Temporal Difference Error: The agent calculates the difference between its expected reward and the actual reward it receives. This difference, called the "temporal difference error," helps the agent adjust its Q-values to be more accurate.

    Screenshot 2025-03-11 at 12 51 22 PM
    TD error = -0.85 - 1(S1, right) = -1.85
    
    Screenshot 2025-03-11 at 12 53 58 PM
    learning rate = (0.1)
    
    Q(S1, right) = 1 + (0.1)(-1.85) = 0.815
    
  • Episodes: The agent repeats the exploration and learning process many times. Each sequence of steps, until the agent reaches a terminal state (+10 or -10), is called an "episode".

  • Exploitation: As the agent explores, the Q-table values become more accurate. Eventually, the agent starts to use its Q-table to make decisions. It chooses the action with the highest Q-value in a given state, following what is known as the "target policy".

    Screenshot 2025-03-11 at 12 54 21 PM
  • Off-Policy: Q-learning is an "off-policy" method because the agent can explore randomly (behavior policy) while learning an optimal strategy (target policy).

reference: https://www.youtube.com/watch?v=TiAXhVAZQl8

A less simplified version

Screenshot 2025-03-11 at 1 42 54 PM

Q-Learning Advanced

Replay buffer

  1. What Is It?

    • A replay memory buffer is a data structure—often a queue—that stores past transitions (state, action, reward, next_state, done) encountered by the agent.
  2. Why Use It?

    • Break Correlations: By randomly sampling past experiences, you reduce the correlation present in sequential samples.
    • Data Efficiency: You can reuse transitions multiple times, rather than throwing them away right after one update.
  3. Benefits

    • Stabilizes training: Particularly important for deep RL, where correlated inputs can cause the Q-network to oscillate wildly.
    • Improves performance: Empirically shown to help the agent converge faster and more reliably.

Epsilon Greediness Exploration

Motivation: If you always pick the highest Q‐valued action, you might miss out on discovering better actions. Random actions (ϵ fraction of the time) let you gather new information.

Implementation:

  • Generate a random number 0≤p≤1
  • If p<ϵ, pick an action at random.
  • Otherwise, pick $argmax_a Q(s,a)$.

Fixed Q‐Targets / Target Networks

What Problem Does Fixed Q‐Targets Solve?

In Q‐learning, we want to update our estimate of the action‐value function:

[ $Q_\theta(s, a) \leftarrow r + \gamma \max_{a'} Q_\theta(s', a')$. ]

However, if $(\theta)$ is the same set of parameters the network uses to compute both:

  1. The left side, $(Q_\theta(s,a))$ we’re updating, and
  2. The right side, $(\max_{a'}Q_\theta(s',a'))$ – the target,

then each time we change $(\theta)$ (by gradient descent), the target also changes, causing rapid oscillations and an unstable learning process.

Fixed Q‐Targets (or “target network”) addresses this by using a separate, periodically updated network to compute the target. This means the target you’re trying to match doesn’t bounce around every training step.

How Fixed Q‐Targets Work

  1. Two Networks

    • Main (online) Q‐Network: with parameters $(\theta)$. This is the network you train every step or batch using gradient descent.
    • Target Q‐Network: with parameters $(\theta^{-})$. This network is copied from the main network every so often (e.g., every N steps or episodes), but otherwise remains frozen in between.
  2. Target Computation

    • When computing the target for a transition $((s,a,r,s'))$: [ $y = r + \gamma ,\max_{a'} Q_{\theta^{-}}(s', a')$, ] you use the target network $(Q_{\theta^{-}})$ to get $(\max_{a'}Q_{\theta^{-}}(s', a'))$.
    • The main network $(Q_\theta(s,a))$ is then updated to minimize the difference $((y - Q_\theta(s,a))^2)$.
  3. Syncing (Copying) the Networks

    • Periodically (e.g., after a certain number of steps or episodes), we do $(\theta^{-} \leftarrow \theta)$.
    • This updates the target network to the main network’s parameters, freezing them again for the next interval.
  4. Result

    • Because the target network is only updated occasionally, it doesn’t change at every optimization step. That means the target $(y)$ stays relatively stable while the main network “catches up.”
    • This dramatically stabilizes training in practice.

Why It Stabilizes Learning

  • If you used the same network to compute both $(Q_{\theta}(s,a))$ and the target $(\max_a Q_{\theta}(s',a))$ at every step, the target would be immediately influenced by any small tweak in parameters. This can cause feedback loops and blow up the estimates.
  • Fixed Q‐Targets decouple the two roles:
    • Estimation role: Q_main is updated from data.
    • Target role: Q_target provides a stationary reference for a while, then is updated periodically in one big “catch‐up.”

Double Q‐Learning

An algorithm designed to address overestimation bias in Q‐learning.

Pass 1: High‐Level Overview

  1. What Is Double Q‐Learning?

    • A modification to standard Q‐learning that uses two separate Q‐value estimates to reduce the overestimation of action values.
  2. Why Overestimation Happens

    • In vanilla Q‐learning, the update target is: [ $r + \gamma \max_{a'} Q(s', a')$. ]
    • Because you take the max over an imperfect Q function, noise or errors in Q can lead to systematically overestimated values.
  3. Key Idea

    • Split the Q function into two estimates (e.g., $(Q^A)$ and $(Q^B))$, and alternate which one you use for the action selection vs. action evaluation parts of the Bellman target.
    • This decoupling reduces the bias that arises from using the same Q to both pick and evaluate the best action.

Pass 2: More Detail on the Algorithm

  1. Overestimation Bias in Standard Q‐Learning

    • If (Q) is optimistic for certain actions (even by accident), the update target can become inflated.
    • Over many updates, the Q values may drift too high, hurting learning stability and policy quality.
  2. Double Q‐Learning Updates

    • Keep two Q‐tables (or in deep RL, two Q‐networks):
      • $(Q^A(s,a))$
      • $(Q^B(s,a))$
    • For each transition $((s,a,r,s'))$, you do two updates in a split manner:
      1. Update $Q^A$: [ $Q^A(s,a) \leftarrow Q^A(s,a) + \alpha \Big(r + \gamma \underbrace{\max_{a'}Q^B(s',a')} - Q^A(s,a)\Big)$. ]
      2. select best action via $(Q^B)$
      3. Or update $Q^B$ similarly but select the best action via $(Q^A)$.
    • In practice, you often alternate updates or randomly choose which Q to update each step.
  3. Motivation

    • Because you use one Q to choose the action with $(\max_{a'} Q^B(s',a'))$ and the other Q to evaluate it, you’re less prone to repeatedly inflating the same Q’s estimates.
  4. Double DQN (Deep Q‐Network)

    • In the deep version, you have two networks:
      1. Online network $(Q_\theta)$.
      2. Target network $(Q_{\theta^-})$.
    • Double DQN modifies the target calculation to: [ $y = r + \gamma , Q_{\theta^-}\bigl(s', \arg\max_a Q_{\theta}(s', a)\bigr)$. ]
    • That’s conceptually similar to the two‐table approach, reducing the overestimation.
  • Standard Q‐learning can overestimate action values because it uses the same Q to select and evaluate actions in the max.
  • Double Q‐learning uses two Q estimates, or a variant with one online network plus a target network, to decouple action selection and evaluation.
  • This yields less bias and often better performance in both tabular and deep RL settings.

Double DQN

  • Original Double Q-learning introduced a second Q-table
  • Alternated between both tables to decouple action selection from value estimation

Double DQN Approach

  • Uses same core concepts as vanilla DQN with fixed Q-targets
  • Key difference: separates networks for action selection and value estimation
  • Instead of adding another network, DDQN:
    • Uses online network for next action selection
    • Uses target network only for value estimation
  • Not identical to Double Q-learning (doesn't alternate between networks)
  • Designed to get most benefit with minimal change to original DQN algorithm

Implementation Differences

  • Initialization of online and target networks: identical to DQN
  • Target network update procedure: identical to DQN
  • Only difference is in target Q-value calculation:
    • DQN: Target network used for both action selection and value estimation
    • DDQN: Online network selects action, target network evaluates action value

Performance

  • DDQN shows substantially higher median and average scores across Atari environments
  • Performance is task-dependent
  • In some cases, vanilla DQN may perform as well or better
  • Best practice is to try both approaches

Prioritized Experience Replay

Pass 1: High‐Level Overview

  1. Experience Replay (Standard)

    • In standard replay, we store transitions (state, action, reward, next state, done) in a buffer, and sample uniformly from this buffer to form training batches.
  2. Why Prioritization?

    • Some transitions matter more than others—for example, those with large TD‐error (the difference between predicted Q‐value and actual return) may have more to teach the agent.
    • If we sample them more frequently, the agent can learn faster from these high‐error or “surprising” experiences.
  3. Prioritized Experience Replay

    • Instead of sampling uniformly, store a priority $(p_i)$ for each transition $(i)$ in the buffer.
    • Transitions with higher priority are sampled more often.

Pass 2: More Detail

  1. Defining Priorities

    • Typically, priority $(p_i)$ is proportional to the TD‐error:
      [ $p_i \propto \bigl|\delta_i\bigr| + \epsilon$ ] where $(\delta_i = r + \gamma \max_a Q(s',a) - Q(s,a))$ is the TD‐error, and $(\epsilon)$ is a small constant to avoid zero probability.
  2. Sampling from a Priority Distribution

    • The probability of sampling transition (i) is: [ $P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}$, ] where $(\alpha)$ adjusts how much prioritization is used (0 = uniform, 1 = fully proportional).
  3. Importance Sampling Weights

    • Because we’re sampling non‐uniformly, we introduce a bias in the update. Transitions with higher TD‐error are sampled more.
    • To correct for this, we apply importance sampling (IS) weights: [ $w_i = \Bigl(\frac{1}{N} \cdot \frac{1}{P(i)}\Bigr)^\beta$, ] where $(N)$ is the buffer size and $(\beta)$ is an annealing parameter (increasing over time) to fully compensate for the bias at the end of training.
  4. Updating Priorities

    • After you compute a transition’s new TD‐error, you update its priority in the replay buffer.
    • This means transitions that remain “surprising” keep getting sampled frequently until the network learns them.

Importance sampling (IS) weights and the role of $(\beta)$ in Prioritized Experience Replay

In standard experience replay, we sample transitions uniformly from the buffer. If we switch to prioritized sampling (where some transitions have higher probability of being chosen), we introduce a sampling bias:

  • High‐error transitions get sampled more often, while low‐error transitions get sampled less often.
  • But in Q‐learning theory, the updates are assumed to come from an unbiased sample of the state‐action space or replay buffer.

Hence, we need to correct for this bias to ensure our gradient updates don’t skew toward frequently sampled transitions. That’s where importance sampling (IS) weights come in.

Importance Sampling Weights $(w_i)$

Each transition (i) in the replay buffer has some sampling probability $(P(i))$ proportional to its priority. Suppose:

[ $P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}$ ]

where $(p_i)$ is the transition’s priority and (\alpha) controls how much prioritization is used.

  • If $(\alpha = 0)$, it’s just uniform sampling.
  • If $(\alpha = 1)$, it’s fully proportional to the priorities.

Since we’re not sampling uniformly, each sample (i) must be reweighted to restore unbiasedness. This weight is:

[ $w_i$ = $\Bigl(\frac{1}{N} \cdot \frac{1}{P(i)}\Bigr)^\beta$, ]

where:

  • $(N)$ is the replay buffer size.
  • $(\beta)$ is an exponent (between 0 and 1, typically) that determines how much we compensate for the bias.
  • At $(\beta=0)$, there is no correction (like we’re ignoring the bias).
  • At $(\beta=1)$, we fully compensate for the difference in sampling probability.

Then, when we do our Q‐learning update, we multiply the TD error (\delta_i) by (w_i) to reduce the update for transitions that were sampled more often than they would be under a uniform distribution.

3. Role of $(\beta)$

$(\beta)$ dictates how strongly we apply this correction:

  1. Early in Training

    • We might not fully correct for the bias, so $(\beta)$ is small (e.g., 0.4).
    • This means we’re partly capitalizing on the benefit of seeing interesting transitions more often, with only partial bias correction.
  2. Later in Training

    • We gradually anneal $(\beta)$ toward 1.0.
    • This ensures that by the end of training, the updates are almost unbiased, preserving theoretical convergence guarantees.

Why Anneal $(\beta)$?

  • Early on, we want to focus on high‐error transitions (less worried about theoretical bias).
  • As we approach convergence, we want to ensure we’re doing unbiased updates so we converge correctly.
  • Hence, we gradually crank $(\beta)$ up to 1.0.

Summary of the Process

  1. Compute Priority & Probability: Each transition has a TD error $(\delta_i)$. The priority is $(p_i$ = $|\delta_i|$ + $\epsilon)$. Probability is $(P(i) \propto p_i^\alpha)$.
  2. Sample Batch: Randomly pick transitions from the buffer according to $(P(i))$.
  3. Compute IS Weights: For each sampled transition $(i)$,
    [ $w_i = \Bigl(\frac{1}{N} \cdot \frac{1}{P(i)}\Bigr)^\beta$. ]
  4. Update: The Q‐learning (or policy) update is scaled by $(w_i)$. This lowers the update for transitions that have been oversampled.
  5. Anneal $(\beta)$: Over time, $(\beta)$ moves from ~0.4 or 0.5 up to 1.0.

Key Takeaways

  • Prioritized Replay biases sampling toward high‐error transitions.
  • Sampling Bias arises because we’re no longer sampling uniformly.
  • Importance Sampling Weights $(w_i)$ correct that bias, ensuring each transition’s contribution to the update is appropriately scaled.
  • $(\beta)$ controls how strongly we apply this correction. By annealing $(\beta)$ from a smaller value to 1.0, we gain the benefits of prioritization early on while still maintaining convergence guarantees later.

Policy Learning

Advantages:

  • can be stochastic
  • handle continous spaces
  • directly optimise for the objective

Dis-advantages:

  • High variance
  • Less sample efficient

Policy gradient and REINFORCE

Differences between REINFORCE and DQN

  • REINFORCE is a Monte-Carlo method, whereas DQN was based on Temporal Difference.
  • Monte-Carlo means that we update the network at the end of the episode, once the entire trajectory and episode return have been observed, rather than at every step in DQN.

REINFORCE

Key Idea

  • Collect full episodes (trajectories) from an on‐policy agent.
  • For each time step in the episode, update the policy parameters θ in the direction that increases the likelihood of actions that led to higher returns.

Main Limitation

  • REINFORCE can have high variance because it scales the policy gradient by the entire return from a trajectory, which might be very noisy.
  • Often improved by using baselines (like a learned value function) or by using the advantage function.
Screenshot 2025-03-12 at 3 34 48 PM

Policy Representation

  • We have a stochastic policy $(\pi_\theta(a \mid s))$. For example, a neural network that outputs probabilities over discrete actions (or the parameters of a continuous distribution).

Sampling Trajectories

  • From state $(s_0)$, sample actions $(a_0, a_1, \ldots, a_T)$ using the current policy $(\pi_\theta)$.
  • Collect rewards $(r_0, r_1, \ldots, r_T)$.
  • One episode ends either at time $(T)$ or when we hit a terminal state.

Return Computation

  • The return from time step $(t)$ is: [ $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$ ] (finite or until termination). For a Monte Carlo approach, we only compute this at episode end.

Gradient Update

  • For each time step $(t)$, REINFORCE updates: [ $\theta \leftarrow \theta + \alpha ; \underbrace{\nabla_\theta \log \pi_\theta(a_t \mid s_t)}_{\text{log-derivative trick}} \Bigl(G_t - b\Bigr)$, ] where $(b)$ is an optional baseline (a constant or learned value).
  • The factor $(\log \pi_\theta(a_t \mid s_t))$ tells us how to adjust the probability of the actions taken, weighted by how good the outcome was $((G_t))$.

Advatange Actor-Critic algorithm

WHY A2C algo?

  • REINFORCE is a good introduction to policy-gradient methods but has a few limitations. REINFORCE has high variance.
  • In the simplest case, we use one trajectory to evaluate the policy gradient. This makes training very unstable.
  • As a Monte Carlo method, REINFORCE only learns at the end of each episode.
  • In contrast, temporal difference methods learn at each step, allowing for more efficient learning.
  • Actor critic methods address these issues with the Critic network, unlocking Temporal Difference learning.

Intuition behind A2C

  • Imagine a student preparing for an exam. She could study alone and only get feedback when the exam grades are in or join a study group and regularly have peers test her knowledge.
  • The student is like the actor network, deciding what to study and answering practice questions, but poorly placed to evaluate her own progress.
  • The study group is like the critic network, providing regular feedback to help the student learn better and faster.
  • More formally, the Critic network is a value network. Its role is to evaluate the value function at every step to judge the quality of the latest action and provide this feedback to the Actor.

Critic Network

  • Like the Q-networks we encountered in Deep Q-Learning, the critic network is a value function approximator;
  • However, it approximates the state-value function (V) rather than the action-value function (Q).
  • The Critic's role is to judge the Actor's latest action $a_t$ using its advantage or TD-error.
  • The architecture of the Critic network is similar to that of Q-networks, but it has a single output node for the state value.

Actor-critic dynamics

  • The Actor network resembles the policy network in REINFORCE. It represents the agent's stochastic policy, sampling an action at each step.

  • The critic network observes the resulting reward and updated state from the environment.

  • The critic evaluates the TD Error.

  • Actor and Critic use the TD Error to update weights.

  • The updated Actor network also observes the new state.

  • The process starts over, with the actor selecting its action for the next step.

    Screenshot 2025-03-12 at 3 53 50 PM

A2C losses

  • We need both the critic and actor networks to learn over time.
  • The critic needs to approximate the state value better, and the actor needs to improve its policy.
  • The critic loss is the squared TD error. The TD error is the critic's rating of the action. Actions with a good outcome have a positive TD error.
  • The actor loss is calculated by taking the negative of the action's log probability times the TD error. This means actions leading to good outcomes (positive TD error) become more likely.
Screenshot 2025-03-12 at 3 59 27 PM Screenshot 2025-03-12 at 4 33 34 PM

Vanilla Policy Gradient (VPG)

(https://www.youtube.com/watch?v=k5AnU_zFkac&list=PLvcbYUQ5t0UEu7ajODJiMv9KXJrZTROHR&index=2) Remember the example of inventory management (more or less demand/ pre-order)

Definition:

  • Vanilla Policy Gradient methods directly optimize the parameters of a policy by ascending the gradient of some performance measure (e.g., the expected return). Core Idea:
  • Instead of learning a value function first and deriving a policy from it, you learn a parameterized policy 𝜋𝜃(a∣s) directly.
  • You gather trajectories by sampling actions from the current policy. Then, you adjust the policy parameters to increase the probability of actions that lead to higher returns.

References

Proximal policy optimization

Why PPO?

  • PPO is designed to improve on naive policy gradient methods by constraining the size of policy updates, avoiding destructive large parameter shifts.
  • It is often preferred over more complex algorithms like TRPO (Trust Region Policy Optimization) because it’s relatively simple to implement and tune, yet achieves excellent results on many benchmarks.

Key Idea

  • PPO clips the policy update in a way that penalizes changes to the policy that move the new policy probability ratio too far from 1.0 (i.e., from the old policy).
  • This keeps updates to the policy proximal (close) to the old policy, preventing large jumps that can destabilize learning.

On‐Policy Method

  • PPO is on‐policy: it gathers data from the current policy and then discards or de‐values old data after an update.
  • Typically uses many parallel environments to quickly gather rollout data each iteration.
Screenshot 2025-03-12 at 7 05 22 PM Screenshot 2025-03-12 at 7 03 57 PM Screenshot 2025-03-12 at 7 07 31 PM

Entropy bonus

  • Policy gradient algorithms such as A2C may collapse too early into deterministic behaviors, assigning a 0 probability to some actions.
  • It is as if your Mars rover, having made most of its recent progress by going forward, reduced over time to zero the probability of going backwards, assuming wrongly that it is never a good action.
  • Adding an entropy bonus to the objective function helps avoid this, and encourages exploration. The entropy of a probability distribution is a concept from information theory. It measures the uncertainty of its outcome.
Screenshot 2025-03-12 at 8 48 58 PM

As an example, a policy with probability uniformly spread across 4 actions has an entropy of 2 bits. If spread only between 2 actions, such as a coin flip, the entropy is one bit. Finally, if the policy is fully deterministic, its entropy is zero.

In practice, updating the policy over a batch of steps rather than at every step is typically preferable, much like how experience replay enhances DQN performance. It stabilizes the training process and leverages the parallel processing capabilities of modern hardware.

Screenshot 2025-03-12 at 9 23 12 PM

on-policy algo

  • On-policy algorithms require fresh data collected from the latest version of the policy for each update.
  • This makes them less sample-efficient compared to off-policy methods (which can reuse past experiences).
  • However, they have a major advantage: They optimize the policy directly rather than indirectly through value function estimates, leading to more stable learning.
  • The progression from VPG → TRPO → PPO represents efforts to reduce the inefficiencies of basic policy gradient methods while maintaining their advantages.

1. What is On-Policy Learning?

  • On-policy methods update the policy using only the most recent data collected under the current policy ( \pi_\theta ).
  • Example: In Vanilla Policy Gradient (VPG), we:
    1. Collect a batch of episodes using the current policy.
    2. Compute policy gradients based on those episodes.
    3. Discard the old data and repeat the process.
  • The problem? Inefficiency. Since we throw away past data, we require a lot of interactions with the environment to train effectively.

2. Why Can’t We Use Old Data?

  • The policy is changing at every step.
  • If we used old data from an older policy ( \pi_{\theta_{\text{old}}} ), the gradients would not accurately reflect how the current policy ( \pi_{\theta_{\text{new}}} ) should be improved.
  • Mathematically, the expected gradient estimates we rely on are only valid under the distribution of actions and states that our current policy generates.

3. What is Off-Policy Learning?

  • Off-policy methods (like DDPG, SAC, and Q-learning) store experiences in a replay buffer and reuse old data.
  • This makes them much more sample-efficient since they learn from past experiences multiple times.

4. Why Stick to On-Policy Despite Lower Sample Efficiency?

  • Stability and Direct Optimization: On-policy methods optimize exactly the objective we care about (expected return under the policy we are training).
  • Avoids Policy Divergence Issues: In off-policy methods, the data comes from an older policy, which can lead to instability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment