Skip to content

Instantly share code, notes, and snippets.

@thehunmonkgroup
Created May 1, 2025 13:15
Show Gist options
  • Select an option

  • Save thehunmonkgroup/fc0200a8180fabaa3ad58708f46f7098 to your computer and use it in GitHub Desktop.

Select an option

Save thehunmonkgroup/fc0200a8180fabaa3ad58708f46f7098 to your computer and use it in GitHub Desktop.
Summary: Cracking The Code Of Action: A Generative Approach To Affordances For Reinforcement Learning

URL: https://arxiv.org/pdf/2504.17282

Cracking The Code Of Action: A Generative Approach To Affordances For Reinforcement Learning


QUESTION 1:

Can you give me a very clear explanation of the core assertions, implications, and mechanics elucidated in this paper?

ANSWER 1:

Okay, let's break down the core assertions, implications, and mechanics of the paper "Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning".

Based solely on the provided paper:

Core Problem Addressed:

  • Reinforcement Learning (RL) agents struggle with sample efficiency, especially in environments with large action spaces and sparse rewards (like web GUIs).
  • Learning often requires vast amounts of domain-specific expert demonstrations (used in Behavior Cloning - BC), which are costly and burdensome to collect. BC also suffers from limitations like the imitation gap.
  • The paper focuses specifically on the low-data regime, where there is limited or no access to expert demonstrations.

Proposed Solution: CoGA (Code as Generative Affordances)

  • CoGA aims to improve RL sample efficiency by constraining the action space using intent-based affordances.
  • Affordances (in this context): Actions that can complete intended consequences (intents). They represent the subset of actions relevant in a given state.
  • Intents: Abstract, desired outcomes related to the type of task (e.g., "click on a tab"), distinct from specific goals/instructions (e.g., "click on Tab 1").
  • CoGA uses pre-trained Vision-Language Models (VLMs) to automatically generate Python code (affordance scripts).
  • This generated code takes a pixel observation (screenshot) as input and outputs the set of affordable actions (action types and their pixel coordinates) for that specific situation.
  • This code is then used within the RL training and inference loop to mask the action space, forcing the agent to only consider relevant actions.

Mechanics of CoGA (How it Works):

  1. Code Generation Pipeline:

    • A VLM (GPT-4o is mentioned) is prompted with task descriptions and example observations.
    • Intent Identification: The VLM identifies relevant high-level intents for the task type.
    • Object Identification & Template Extraction: For each intent, the VLM identifies relevant visual objects (e.g., buttons, tabs). It specifies their bounding boxes on example images (using a grid overlay). These bounding boxes are used to automatically crop and save template images of the objects.
    • Strategy & Code Writing: Using chain-of-thought prompting, the VLM develops strategies for identifying relevant intents and affordable actions in any given state. It then writes Python code implementing this strategy. This code uses the saved template images and a pre-written template matching script (using OpenCV, specifically cv2.matchTemplate on grayscaled images) to dynamically detect the locations of affordable objects (and thus actions) in new observations.
    • Output: The generated function (determine_affordable_actions(image)) returns a list of dictionaries, each specifying an affordable action type and its coordinates (e.g., {'action': 'CLICK_COORDS', 'coords': [x_left, y_upper, x_right, y_down]}).
  2. Verification Pipeline:

    • The generated script is automatically tested on 5 manually annotated ground truth observations.
    • Error Checking: Checks for execution errors.
    • Critique VLM: Another VLM reviews the code (and its output on 2 test observations) for correctness and provides feedback. If errors occur or the critique VLM finds issues, it attempts to regenerate improved code based on the feedback. This process repeats up to 3 times.
    • Quality Metrics: Precision and recall are calculated against the ground truth annotations.
    • Selection: The script performing best (highest F1-score based on precision/recall on the test cases) across runs and critique iterations is selected.
  3. Integration with RL:

    • The selected affordance script is called during the RL agent's training and inference steps.
    • Given the current observation, the script provides the set of affordable actions.
    • This set is used to create a hard mask on the RL agent's action space (probabilities of unaffordable actions are set to 0). The agent can only sample from the predicted affordable actions.

Core Assertions (Claims made by the authors):

  1. Sample Efficiency: CoGA is orders of magnitude more sample efficient than the baseline RL agent (DQN in this case) without affordances. (Claim 1, Sec 1 & 4.3)
  2. Generalization: CoGA's generated affordance scripts can generalize within the same family of tasks (tasks with identical GUIs but different specific goals/instructions). (Claim 2, Sec 1 & 4.3)
  3. Low-Data Performance: CoGA performs better or on par compared to Behavior Cloning (BC) when only a limited number of expert demonstrations (e.g., up to 200 trajectories in their experiments) are available. (Claim 3, Sec 1 & 4.3)

Implications of the Work:

  • Provides a method to significantly boost RL sample efficiency in complex, large-action-space environments like web navigation, particularly when expert data is scarce or unavailable.
  • Demonstrates a way to leverage the reasoning capabilities of VLMs without incurring the high computational/financial cost of querying them at every step during RL training/inference, by distilling their knowledge into efficient, executable code.
  • Offers a potential alternative or supplement to Behavior Cloning, especially in low-data regimes, reducing the dependency on costly expert demonstrations.
  • The generated affordances focus on the possibilities within an interface (intents), allowing the RL agent to learn the policy (which specific affordable action to take for the current goal) more efficiently.

QUESTION 2:

Can you explain the value of this in basic terms? Like you're talking to a CEO. So what? What's the bottom line here?

ANSWER 2:

Alright, let's cut to the chase. Imagine you want to train an AI to perform tasks on a computer, like navigating a website or using software – basically, interacting with a screen like a human would.

The Problem: Training these AIs the traditional way is incredibly slow and expensive. They often need to see millions of examples of a human doing the task perfectly, which costs a lot of time and money to create. Even then, they might not learn very efficiently, wasting time trying completely irrelevant actions, like clicking blank space on a webpage.

Our Solution (CoGA): We've developed a smarter way. We use a powerful, pre-existing AI (think of it like a knowledgeable supervisor, a VLM) to quickly figure out what actions are actually possible and relevant on the screen at any given moment (like "these buttons can be clicked", "this slider can be moved"). This supervisor AI then writes simple instructions (code) that tell the trainee AI: "Hey, in this situation, only focus on these specific actions."

So What? (The Bottom Line):

  1. Massive Speed-Up: Our trainee AI learns the task orders of magnitude faster because it's not wasting time exploring useless options. It focuses only on relevant actions from the start.
  2. Drastic Cost Reduction: We need far fewer, or even zero, expensive human demonstrations. The supervisor AI provides the necessary guidance, saving significant data collection costs.
  3. Better Results with Less: Even when you only have a small amount of human example data, our method performs better than or equal to just trying to copy those limited examples.
  4. Efficiency: The "instructions" created by the supervisor AI can often be reused for similar tasks, making the whole process more efficient.

In short: We make training AI to use computer interfaces significantly faster, cheaper, and more data-efficient by intelligently guiding the learning process. This means you can deploy capable AI agents for automation tasks more quickly and with less upfront investment.


QUESTION 3:

Can you give me an analogy or metaphor that will help explain this to a broad audience?

ANSWER 3:

Okay, here's an analogy to explain CoGA:

Imagine teaching a robot to cook in a huge, professional kitchen with every possible ingredient and gadget available.

The Old Way (Standard RL): You tell the robot, "Make spaghetti." The robot starts randomly. It might try mixing flour with orange juice, putting a whisk in the oven, or tasting raw onions. It could theoretically do anything with anything in the kitchen. It will take ages (and waste a lot of ingredients!) trying countless useless combinations before it stumbles upon the right steps for spaghetti. To speed it up, you might have to show it millions of videos of chefs making spaghetti perfectly.

The CoGA Way: Before the robot even starts, an experienced Head Chef (the VLM) takes a quick look at the "Make Spaghetti" recipe and the current state of the kitchen counter. The Head Chef quickly writes down a very short, specific list of Relevant Tools & Actions for this step (this is the generated code/affordances).

For example, if the robot needs to boil water, the Head Chef's note might say:

  • "Relevant items right now: Pot, Water Tap, Stove Knob."
  • "Possible actions: Pick up Pot, Put Pot under Tap, Turn on Tap, Turn off Tap, Put Pot on Stove, Turn Stove Knob."

The robot (the RL agent) gets this short list. Now, instead of considering the thousands of useless things it could do (like using the ice cream scoop or the blender), it only focuses on the 6 relevant actions on the list. It still needs to learn the correct sequence (fill pot before putting on stove), but its learning is drastically faster because it's only considering sensible options.

The Bottom Line: CoGA is like giving the learning robot a constantly updated, super-focused checklist from an expert chef, telling it exactly what tools and actions are relevant right now. This stops the robot from wasting time on nonsensical actions and helps it learn the recipe (the task) much, much faster, without needing to watch millions of cooking videos first.


QUESTION 4:

Can you provide a specific example that best illustrates the key points elucidated in this paper?

ANSWER 4:

Okay, let's use the click-test-2 task mentioned in the paper (Figure 2) as a specific example to illustrate the key points:

The Scenario:

  • Imagine a simple webpage showing just two identical-looking buttons, labeled "Button ONE" and "Button TWO".
  • The task instruction given to the AI agent changes randomly each time: sometimes it's "Click button ONE", other times it's "Click button TWO".
  • The agent only gets a reward (+1) if it clicks the correct button specified in the instruction for that specific instance. Clicking the wrong button or anywhere else gives no reward (or negative).

1. Without CoGA (Standard RL Agent):

  • The Challenge: The agent sees the screen pixels and the instruction. Its action space is huge – it could potentially click any of the thousands of pixels on the 160x210 screen (or perform other actions like dragging, if available).
  • Learning Process: The agent starts by randomly clicking pixels. Most clicks do nothing. Occasionally, it might accidentally hit Button ONE or Button TWO. Only if it happens to hit the correct button for that specific instruction does it get a reward.
  • Problem: Learning is incredibly slow. It takes a vast number of tries (samples) for the agent to figure out that only the pixels corresponding to the two buttons are ever useful, and then it still has to learn which of the two buttons corresponds to which instruction. This is highly sample inefficient.

2. With CoGA:

  • Step 1: Generating Affordances (Offline, using the VLM):
    • The VLM is shown the task description ("Click button ONE or TWO") and example screenshots.
    • It identifies the high-level intent: "The agent needs to be able to click a button". (Crucially, not "click Button ONE" or "click Button TWO", but the general capability).
    • It identifies the relevant objects for this intent: the visual elements corresponding to "Button ONE" and "Button TWO".
    • It generates code (the affordance script). This script uses template matching (based on the visual appearance of the buttons extracted earlier) to find the locations (bounding boxes) of both Button ONE and Button TWO on any given screenshot.
  • Step 2: Using Affordances in RL (Online):
    • The RL agent starts a new task instance (e.g., instruction is "Click button ONE"). It gets the screenshot.
    • Instead of considering all possible pixel clicks, it first runs the CoGA affordance script on the screenshot.
    • The script returns a very small list of affordable actions:
      • {'action': 'CLICK_COORDS', 'coords': [coords for Button ONE]}
      • {'action': 'CLICK_COORDS', 'coords': [coords for Button TWO]}
    • Action Space Pruning: The RL agent's action space is now drastically reduced from thousands of possibilities to just two: click Button ONE or click Button TWO.
    • Learning Process: The agent now only needs to learn the policy mapping: "If instruction says 'Click button ONE', choose the first affordable action; if instruction says 'Click button TWO', choose the second affordable action." This is a much simpler learning problem.
  • Benefit: The agent learns orders of magnitude faster because it's guided to only consider the relevant actions from the start. It doesn't waste time clicking irrelevant parts of the screen. CoGA figured out what can be clicked without needing any demonstrations, significantly improving sample efficiency in this low-data scenario. The affordance script ("find buttons") also generalizes – it works whether the goal is Button ONE or Button TWO.

This click-test-2 example clearly shows how CoGA uses the VLM to identify general intents ("click a button"), generates code to find relevant objects (both buttons), and uses that code to drastically prune the action space, allowing the RL agent to learn the specific task policy much more efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment