You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[arxiv] Reflexion: Language Agents with Verbal Reinforcement Learning
Source
Authors: Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
Abstract: Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
Reasoning (HotPotQA): Reflexion agents achieve significant improvements over baselines.
Programming (HumanEval, MBPP, LeetcodeHardGym): Reflexion agents achieve state-of-the-art results on various code generation benchmarks.
Contributions:
Proposes Reflexion, a new paradigm for "verbal" reinforcement.
Demonstrates the effectiveness of self-reflection for learning complex tasks.
Introduces LeetcodeHardGym, a new code-generation RL environment.
Limitations:
Relies on the LLM's self-evaluation capabilities.
No formal guarantee of success.
Memory component is limited to a sliding window.
Broader Impact:
Potential for greater automation and work efficiency.
Amplifies risks of misuse.
May improve interpretability and alignment of autonomous agents.
Conclusion: Reflexion is a promising approach for teaching language agents to learn from past mistakes, achieving significant improvements over existing methods.
▲ Could you elaborate with an example?
Let's imagine a Reflexion agent tasked with solving a simple programming problem: writing a function to calculate the sum of all even numbers in a list.
Scenario:
Initial Attempt: The agent generates a function that only sums all numbers in the list, ignoring the even number requirement.
Feedback: The agent receives a "failure" signal from the environment (a unit test fails).
Self-Reflection: The agent analyzes the feedback and generates a verbal reflection: "I need to check if each number is even before adding it to the sum."
Memory Update: This reflection is stored in the agent's memory.
Subsequent Attempt: The agent now has access to its previous reflection. It uses this information to modify its function, adding a check for even numbers.
Success: The agent successfully generates a function that correctly calculates the sum of even numbers.
Key Points:
Verbal Feedback: Instead of just receiving a "failure" signal, the agent receives a more informative verbal reflection that helps it understand the error.
Episodic Memory: The agent remembers its past reflections, allowing it to learn from its mistakes and improve its performance over time.
Self-Improvement: The agent uses the verbal feedback to modify its code and generate a better solution.
In this example, Reflexion helps the agent learn and improve its programming skills through a process of trial, error, and self-reflection, similar to how humans learn from their mistakes.
▲ Tell me more about how the memory management works
The memory management in Reflexion is designed to capture and leverage the agent's past experiences, specifically its self-reflections, to improve future performance. Here's a breakdown of how it works:
1. Memory Components:
Short-Term Memory: This is the current trajectory of actions and observations the agent is taking. It's like remembering the steps you took in a specific task.
Long-Term Memory: This stores the agent's self-reflections, which are verbal summaries of its past experiences and insights gained from them. It's like remembering the lessons you learned from past mistakes.
2. Memory Update:
After each trial, the agent's self-reflection model analyzes the trial's outcome (success/failure) and generates a verbal reflection.
This reflection is appended to the long-term memory.
3. Memory Access:
The agent's actor model (the part that generates actions) has access to both short-term and long-term memory.
This allows the agent to consider both the current context of the task and its past experiences when making decisions.
4. Memory Limitation:
To manage the size of the long-term memory and avoid exceeding the context limits of LLMs, Reflexion uses a sliding window approach.
This means that only the most recent self-reflections are kept in memory, while older ones are discarded. The size of this window is typically set to 1-3 experiences.
Example:
Imagine an agent trying to solve a puzzle. It might have the following memory:
Short-Term Memory: "I moved the red block to the left, then the blue block to the right."
Long-Term Memory:
"I learned that moving the red block first is usually a good strategy."
"I should avoid moving the green block to the bottom, as it often leads to a dead end."
The agent can use this memory to make better decisions in future attempts. For example, it might prioritize moving the red block first based on its past experience, and avoid moving the green block to the bottom.
Key Points:
Selective Memory: Reflexion focuses on storing self-reflections, which are considered more valuable than simply remembering the actions taken.
Contextual Learning: The agent uses its memory to learn from past experiences and adapt its behavior to the current context.
Bounded Memory: The sliding window approach ensures that the memory remains manageable and relevant to the current task.
This memory management system allows Reflexion agents to learn and improve their performance over time by leveraging their past experiences and insights.
▲ What if the agent learned new memory that is conflicting with existing memory, how would the agent handle that?
That's a great question! You've hit upon a key challenge in any system that relies on memory, especially one that's learning and evolving. Here's how Reflexion might handle conflicting memories, along with some potential limitations:
1. Prioritization Based on Recency:
Reflexion's sliding window memory management inherently prioritizes recent experiences. Newer memories are more likely to be relevant to the current task, so they are given more weight.
This means that if a new memory contradicts an older one, the newer memory is more likely to influence the agent's decisions.
2. Contextual Relevance:
The self-reflection model is designed to generate contextually relevant reflections. This means that the reflections are tailored to the specific task and environment the agent is facing.
If a new memory is highly relevant to the current task, it might override conflicting memories, even if they are more recent.
3. Implicit Reconciliation:
LLMs are known for their ability to handle ambiguity and conflicting information. The agent might implicitly reconcile conflicting memories by considering the context of the current task and the relative strength of the conflicting memories.
For example, if the agent has learned that "moving the red block first is usually a good strategy," but then encounters a situation where it's better to move the blue block first, it might adjust its strategy based on the specific context.
4. Potential Limitations:
Bias: The agent might become biased towards recent experiences, potentially overlooking valuable insights from older memories.
Overfitting: The agent might overfit to specific situations, making it less adaptable to new and unfamiliar scenarios.
Lack of Explicit Reasoning: The agent might not be able to explicitly reason about conflicting memories and explain its decision-making process.
Addressing the Limitations:
Memory Exploration: Future research could explore techniques for the agent to actively explore its memory and identify potential conflicts.
Explicit Reasoning: Developing methods for the agent to explicitly reason about conflicting memories and explain its decision-making process would improve transparency and trust.
Adaptive Memory Management: Dynamically adjusting the size of the memory window or using more sophisticated memory structures could help the agent better manage conflicting memories.
In summary, Reflexion's memory management system prioritizes recent experiences and relies on the LLM's ability to handle ambiguity. While this approach has potential limitations, future research can address these challenges to improve the agent's ability to learn and adapt in the face of conflicting information.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"description": "Abstract: Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance."
Abstract: Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance. We release all code, demos, and datasets at https://github.com/noahshinn024/reflexion.
1 Introduction
Recent works such as ReAct Yao et al., (2023), SayCan Ahn et al., (2022), Toolformer Schick et al., (2023), HuggingGPT (Shen et al.,, 2023), generative agents (Park et al.,, 2023), and WebGPT (Nakano et al.,, 2021) have demonstrated the feasibility of autonomous decision-making agents that are built on top of a large language model (LLM) core. These methods use LLMs to generate text and ‘actions‘ that can be used in API calls and executed in an environment. Since they rely on massive models with an enormous number of parameters, such approaches have been so far limited to using in-context examples as a way of teaching the agents, since more traditional optimization schemes like reinforcement learning with gradient descent require substantial amounts of compute and time.
In this paper, we propose an alternative approach called Reflexion that uses verbal reinforcement to help agents learn from prior failings. Reflexion converts binary or scalar feedback from the environment into verbal feedback in the form of a textual summary, which is then added as additional context for the LLM agent in the next episode. This self-reflective feedback acts as a ‘semantic’ gradient signal by providing the agent with a concrete direction to improve upon, helping it learn from prior mistakes to perform better on the task. This is akin to how humans iteratively learn to accomplish complex tasks in a few-shot manner – by reflecting on their previous failures in order to form an improved plan of attack for the next attempt. For example, in figure 1, a Reflexion agent learns to optimize its own behavior to solve decision-making, programming, and reasoning tasks through trial, error, and self-reflection.
Generating useful reflective feedback is challenging since it requires a good understanding of where the model made mistakes (i.e. the credit assignment problem (Sutton and Barto,, 2018)) as well as the ability to generate a summary containing actionable insights for improvement. We explore three ways for doing this – simple binary environment feedback, pre-defined heuristics for common failure cases, and self-evaluation such as binary classification using LLMs (decision-making) or self-written unit tests (programming). In all implementations, the evaluation signal is amplified to natural language experience summaries which can be stored in long-term memory.
Reflexion has several advantages compared to more traditional RL approaches like policy or value-based learning: 1) it is lightweight and doesn’t require finetuning the LLM, 2) it allows for more nuanced forms of feedback (e.g. targeted changes in actions), compared to scalar or vector rewards that are challenging to perform accurate credit assignment with, 3) it allows for a more explicit and interpretable form of episodic memory over prior experiences, and 4) it provides more explicit hints for actions in future episodes. At the same time, it does have the disadvantages of relying on the power of the LLM’s self-evaluation capabilities (or heuristics) and not having a formal guarantee for success. However, as LLM capabilities improve, we only expect this paradigm to get better over time.
We perform experiments on (1) decision-making tasks to test sequential action choices over long trajectories, (2) reasoning tasks to test knowledge-intensive, single-step generation improvement, and (3) programming tasks to teach the agent to effectively use external tools such as compilers and interpreters. Across all three types of tasks, we observe Reflexion agents are better decision-makers, reasoners, and programmers. More concretely, Reflexion agents improve on decision-making AlfWorld (Shridhar et al.,, 2021) tasks over strong baseline approaches by an absolute 22% in 12 iterative learning steps, and on reasoning questions in HotPotQA (Yang et al.,, 2018) by 20%, and Python programming tasks on HumanEval (Chen et al.,, 2021) by as much as 11%.
To summarize, our contributions are the following:
•
We propose Reflexion, a new paradigm for ‘verbal‘ reinforcement that parameterizes a policy as an agent’s memory encoding paired with a choice of LLM parameters.
•
We explore this emergent property of self-reflection in LLMs and empirically show that self-reflection is extremely useful to learn complex tasks over a handful of trials.
•
We introduce LeetcodeHardGym, a code-generation RL gym environment consisting of 40 challenging Leetcode questions (‘hard-level‘) in 19 programming languages.
•
We show that Reflexion achieves improvements over strong baselines across several tasks, and achieves state-of-the-art results on various code generation benchmarks.
2 Related work
Reasoning and decision-making
Self-Refine (Madaan et al.,, 2023) employs an iterative framework for self-refinement to autonomously improve generation through self-evaluation. These self-evaluation and self-improvement steps are conditioned on given task constraints, such as "How can this generation be written in a more positive way". Self-Refine is effective but is limited to single-generation reasoning tasks. Pryzant et al., (2023) performs a similar semantic prompt-writing optimization, but is also limited to single-generation tasks. Paul et al., (2023) fine-tune critic models to provide intermediate feedback within trajectories to improve reasoning responses. Xie et al., (2023) use stochastic beam search over actions to perform a more efficient decision-making search strategy which allows the agent to use foresight advantage due to its self-evaluation component. Yoran et al., (2023) and Nair et al., (2023) use decider models to reason over several generations. Kim et al., (2023) use a retry pattern over a fixed number of steps without an evaluation step. Goodman, (2023) perform a qualitative evaluation step that proposes optimizations to the previous generation. In this paper, we show that several of these concepts can be enhanced with self-reflection to build a persisting memory of self-reflective experiences which allows an agent to identify its own errors and self-suggest lessons to learn from its mistakes over time.
Programming
Several past and recent works employ variations of test-driven development or code debugging practices. AlphaCode (Li et al.,, 2022) evaluates a set of generations on hidden test cases. CodeT (Chen et al.,, 2022) uses self-generated unit tests that are used to score generated function implementations. Self-Debugging (Chen et al.,, 2023) employs a debugging component that is used to improve existing implementations given feedback from a code execution environment. CodeRL (Le et al.,, 2022) sets the problem in an RL framework using an actor-critic setup to debug programs given feedback from an execution environment. AlphaCode, Self-Debugging and CodeRL are effective in fixing less-complex program bugs, but they rely upon ground truth test cases that invalidate pass@1 eligibility, and do not use self-reflection to bridge the gap between error identification and implementation improvement. CodeT does not access hidden test cases but does not implement a self-learning step to improve code writing.
3 Reflexion: reinforcement via verbal reflection
We develop a modular formulation for Reflexion, utilizing three distinct models: an Actor, denoted as Masubscript𝑀𝑎M_{a}italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which generates text and actions; an Evaluator model, represented by Mesubscript𝑀𝑒M_{e}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, that scores the outputs produced by Masubscript𝑀𝑎M_{a}italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT; and a Self-Reflection model, denoted as Msrsubscript𝑀𝑠𝑟M_{sr}italic_M start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT, which generates verbal reinforcement cues to assist the Actor in self-improvement. We provide a detailed description of each of these models and subsequently elucidate their collaborative functioning within the Reflexion framework.
Actor
The Actor is built upon a large language model (LLM) that is specifically prompted to generate the necessary text and actions conditioned on the state observations. Analogous to traditional policy-based RL setups, we sample an action or generation, atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, from the current policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT at time t𝑡titalic_t, receive an observation from the environment otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We explore various Actor models, including Chain of Thought Wei et al., (2022) and ReAct Yao et al., (2023). These diverse generation models allow us to explore different aspects of text and action generation within the Reflexion framework, providing valuable insights into their performance and effectiveness. In addition, we also add a memory component mem that provides additional context to this agent. This adaption was inspired by Brooks et al., (2022), who suggest a policy iteration approach using in-context learning. Details on how this is populated are provided below.
Evaluator
The Evaluator component of the Reflexion framework plays a crucial role in assessing the quality of the generated outputs produced by the Actor. It takes as input a generated trajectory and computes a reward score that reflects its performance within the given task context. Defining effective value and reward functions that apply to semantic spaces is difficult, so we investigate several variants of the Evaluator model. For reasoning tasks, we explore reward functions based on exact match (EM) grading, ensuring that the generated output aligns closely with the expected solution. In decision-making tasks, we employ pre-defined heuristic functions that are tailored to specific evaluation criteria. Additionally, we experiment with using a different instantiation of an LLM itself as an Evaluator, generating rewards for decision-making and programming tasks. This multi-faceted approach to Evaluator design allows us to examine different strategies for scoring generated outputs, offering insights into their effectiveness and suitability across a range of tasks.
Self-reflection
The Self-Reflection model instantiated as an LLM, plays a crucial role in the Reflexion framework by generating verbal self-reflections to provide valuable feedback for future trials. Given a sparse reward signal, such as a binary success status (success/fail), the current trajectory, and its persistent memory mem, the self-reflection model generates nuanced and specific feedback. This feedback, which is more informative than scalar rewards, is then stored in the agent’s memory (mem). For instance, in a multi-step decision-making task, when the agent receives a failure signal, it can infer that a specific action aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT led to subsequent incorrect actions ai+1subscript𝑎𝑖1a_{i+1}italic_a start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT and ai+2subscript𝑎𝑖2a_{i+2}italic_a start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT. The agent can then verbally state that it should have taken a different action, ai′superscriptsubscript𝑎𝑖′a_{i}^{\prime}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which would have resulted in ai+1′superscriptsubscript𝑎𝑖1′a_{i+1}^{\prime}italic_a start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ai+2′superscriptsubscript𝑎𝑖2′a_{i+2}^{\prime}italic_a start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and store this experience in its memory. In subsequent trials, the agent can leverage its past experiences to adapt its decision-making approach at time t𝑡titalic_t by choosing action ai′superscriptsubscript𝑎𝑖′a_{i}^{\prime}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This iterative process of trial, error, self-reflection, and persisting memory enables the agent to rapidly improve its decision-making ability in various environments by utilizing informative feedback signals.
Memory
Core components of the Reflexion process are the notion of short-term and long-term memory. At inference time, the Actor conditions its decisions on short and long-term memory, similar to the way that humans remember fine-grain recent details while also recalling distilled important experiences from long-term memory. In the RL setup, the trajectory history serves as the short-term memory while outputs from the Self-Reflection model are stored in long-term memory. These two memory components work together to provide context that is specific but also influenced by lessons learned over several trials, which is a key advantage of Reflexion agents over other LLM action choice works.
The Reflexion process
Reflexion is formalized as an iterative optimization process in 1. In the first trial, the Actor produces a trajectory τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by interacting with the environment. The Evaluator then produces a score r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which is computed as rt=Me(τ0)subscript𝑟𝑡subscript𝑀𝑒subscript𝜏0r_{t}=M_{e}(\tau_{0})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is only a scalar reward for trial t𝑡titalic_t that improves as task-specific performance increases. After the first trial, to amplify r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a feedback form that can be used for improvement by an LLM, the Self-Reflection model analyzes the set of {τ0,r0}subscript𝜏0subscript𝑟0\{\tau_{0},r_{0}\}{ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } to produce a summary sr0𝑠subscript𝑟0sr_{0}italic_s italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which is stored in the memory mem. srt𝑠subscript𝑟𝑡sr_{t}italic_s italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a verbal experience feedback for trial t𝑡titalic_t. The Actor, Evaluator, and Self-Reflection models work together through trials in a loop until the Evaluator deems τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be correct. As mentioned in 3, the memory component of Reflexion is crucial to its effectiveness. After each trial t𝑡titalic_t, srt𝑠subscript𝑟𝑡sr_{t}italic_s italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is appended mem. In practice, we bound mem by a maximum number of stored experiences, ΩΩ\Omegaroman_Ω (usually set to 1-3) to adhere to max context LLM limitations.
4 Experiments
We evaluate various natural language RL setups on decision-making, reasoning, and code generation tasks. Specifically, we challenge an agent to perform search-based question answering on HotPotQA (Yang et al.,, 2018), multi-step tasks in common household environments in AlfWorld (Shridhar et al.,, 2021), and code writing tasks in competition-like environments with interpreters and compilers in HumanEval (Chen et al.,, 2021), MBPP (Austin et al.,, 2021), and LeetcodeHard, a new benchmark. Most notably, Reflexion improves performance over strong baselines by 22% in AlfWorld, 20% in HotPotQA, and 11% on HumanEval.
4.1 Sequential decision making: ALFWorld
AlfWorld is a suite of text-based environments that challenge an agent to solve multi-step tasks in a variety of interactive environments based on TextWorld (Côté et al.,, 2019). Following Yao et al., (2023), we run the agent in 134 AlfWorld environments across six different tasks, including finding hidden objects (e.g., finding a spatula in a drawer), moving objects (e.g., moving a knife to the cutting board), and manipulating objects with other objects (e.g., chilling a tomato in the fridge). We use ReAct (Yao et al.,, 2023) as the action generator as Yao et al., (2023) has shown success in long trajectory decision-making using explicit intermediate thoughts. AlfWorld tasks naturally require a self-evaluation step as the environment can only signal if a task is complete. To achieve fully autonomous behavior, we implement two self-evaluation techniques: natural language classification using an LLM and a hand-written heuristic. The heuristic is simple: if the agent executes the same action and receives the same response for more than 3 cycles, or if the number of actions taken in the current environment exceeds 30 (inefficient planning), we self-reflect. In the baseline runs, if self-reflection is suggested, we skip the self-reflection process, reset the environment, and start a new trial. In the Reflexion runs, the agent uses self-reflection to find its mistake, update its memory, reset the environment, and start a new trial. To avoid very long prompt windows that may exceed the maximum limit, we truncate the agent’s memory to the last 3 self-reflections (experiences).
To avoid syntactic errors, we provide two domain-specific few-shot trajectories to the agent. We use the same few-shot trajectory examples as Yao et al., (2023) with GPT-3 for the LLM. AlfWorld tasks, ReAct few-shot prompts, and Reflexion examples are included in the appendix.
Results
ReAct + Reflexion significantly outperforms ReAct by completing 130 out of 134 tasks using the simple heuristic to detect hallucinations and inefficient planning. Further, ReAct + Reflexion learns to solve additional tasks by learning in 12 consecutive trials. In the ReAct-only approach, we see that performance increase halts between trials 6 and 7.
Analysis
A common error in baseline failed AlfWorld trajectories is when an agent thinks that it has possession of an item but does not actually have the item. The agent proceeds to execute several actions in a long trajectory and is not able to backtrack its actions to find the mistake. Reflexion eliminates almost all of these cases by using self-reflection to distill long, failed trajectories into relevant experiences that can are used as "self-hints" in the future. There are two main cases in which long-term memory helps an agent in AlfWorld: 1) An early mistake in a long trajectory can be easily identified. The agent can suggest a new action choice or even a new long-term plan. 2) There are too many surfaces/containers to check for an item. The agent can exploit its experience memory over several trials to thoroughly search a room. In 3, the learning curve suggests that the learning process occurs over several experiences, meaning that the agent is successfully balancing cases 1 and 2 shown in the immediate spike in the improvement between the first two trials, then a steady increase over the next 11 trials to a near-perfect performance. On the other hand, 3 shows a ReAct-only agent converging at a hallucination rate of 22% with no signs of long-term recovery.
4.2 Reasoning: HotpotQA
HotPotQA (Yang et al.,, 2018) is a Wikipedia-based dataset with 113k question-and-answer pairs that challenge agents to parse content and reason over several supporting documents. To test improvement in reasoning only ability, we implement Reflexion + Chain-of-Thought (CoT) (Wei et al.,, 2022) for step-by-step Q→A→𝑄𝐴Q\rightarrow Aitalic_Q → italic_A and Q𝑄Qitalic_Q, Cgt→A→subscript𝐶𝑔𝑡𝐴C_{gt}\rightarrow Aitalic_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT → italic_A implementations, where Q𝑄Qitalic_Q is the question, Cgtsubscript𝐶𝑔𝑡C_{gt}italic_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is the ground truth context from the dataset, and A𝐴Aitalic_A is the final answer. Since CoT is not a multi-step decision-making technique, we give Cgtsubscript𝐶𝑔𝑡C_{gt}italic_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT to the agent so that we can isolate the reasoning behavior over large sections of the provided text. To test holistic question and answering ability, which requires reasoning and action choice, we implement a Reflexion + ReAct (Yao et al.,, 2023) agent that can retrieve relevant context using a Wikipedia API and infer answers using step-by-step explicit thinking. For CoT implementations, we use 6-shot prompting; for ReAct, we use 2-shot prompting, and for self-reflection, we use 2-shot prompting. All examples can be found in the appendix.
Robustly evaluating natural language answers is a long-standing problem in NLP. Therefore, between trials, we use exact match answer grading using the environment to give a binary success signal to the agent. After each trial, the self-reflection loop is employed to amplify the binary signal, similar to the decision-making setup 4.1 in AlfWorld with a memory size of 3 experiences.
Results
Reflexion outperforms all baseline approaches by significant margins over several learning steps. Furthermore, ReAct-only, CoT-only, and CoT (GT)-only implementations fail to probabilistically improve on any tasks, meaning that no failed tasks from the first trial from any of the baseline approaches were able to be solved in subsequent trials using a temperature of 0.7 In the Reflexion runs, we allowed the agent to gather experience and retry on failed tasks until it produced 3 consecutive failed attempts on the particular task. Naturally, the CoT (GT) achieved higher accuracy scores as it was given access to the ground truth context of the question. Still, the CoT (GT) agent is unable to correctly infer the correct answer for 39% of the questions, but Reflexion helps the agent to correct its mistakes without access to the ground truth answer to improve its accuracy by 14%.
Analysis
We perform an ablation experiment to isolate the advantage of the self-reflective step for reasoning using CoT (GT) as the baseline approach 4. Recall that CoT (GT) uses Chain-of-Thought reasoning with provided ground truth context, which tests reasoning ability over long contexts. Next, we add an element of episodic memory (EPM) by including the most recent trajectory. For the Reflexion agent, we implement the standard self-reflection step as a final pass. Intuitively, we test if the agent is iteratively learning more effectively by using verbal explanation using language written in the first person. 4 shows that self-reflection improves learning by an 8% absolute boost over the episodic memory learning advantage. This result supports the argument that refinement-only approaches are not as effective as self-reflection-guided refinement approaches.
4.3 Programming
We evaluate the baseline and Reflexion approaches on Python and Rust code writing on MBPP (Austin et al.,, 2021), HumanEval (Chen et al.,, 2021), and LeetcodeHardGym, our new dataset. MBPP and HumanEval measure function body generation accuracy given natural language descriptions. We use a benchmark language compiler, MultiPL-E (Cassano et al.,, 2022), to translate subsets of HumanEval and MBPP to the Rust language. MultiPL-E is a collection of small compilers that can be used to translate Python benchmark questions to 18 other languages. We include experiments for Rust code generation to demonstrate that Reflexion implementations for code generation are language-agnostic and can be used for interpreted and compiled languages. Lastly, we introduce a new benchmark, LeetcodeHardGym, which is an interactive programming gym that contains 40 Leetcode hard-rated questions that have been released after October 8, 2022, which is the pre-training cutoff date of GPT-4 (OpenAI,, 2023).
The task of programming presents a unique opportunity to use more grounded self-evaluation practices such as self-generated unit test suites. Thus, our Reflexion-based programming task implementation is eligible for pass@1 accuracy reporting. To generate a test suite, we use Chain-of-Thought prompting (Wei et al.,, 2022) to produce diverse, extensive tests with corresponding natural language descriptions. Then, we filter for syntactically valid test statements by attempting to construct a valid abstract syntax tree (AST) for each proposed test. Finally, we sample n𝑛nitalic_n tests from the collection of generated unit tests to produce a test suite T𝑇Titalic_T, denoted as {t0,t1,…,tn}subscript𝑡0subscript𝑡1…subscript𝑡𝑛\{t_{0},t_{1},\dots,t_{n}\}{ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. We set n𝑛nitalic_n to a maximum of 6 unit tests. Aside from the unit test suite component, the setup for the learning loop for a Reflexion programming agent is identical to the reasoning and decision-making agents with a max memory limit of 1 experience.
Results
Reflexion outperforms all baseline accuracies and sets new state-of-the-art standards on all benchmarks for Python and Rust except for MBPP Python 1. We further investigate the inferior performance of Reflexion on MBPP Python.
Analysis
We acknowledge that self-reflecting code-generation agents are bound to their ability to write diverse, comprehensive tests. Therefore, in the case in which the model generates a flaky test suite, it is possible that all tests pass on an incorrect solution and lead to a false positive label on a code completion (Lam et al.,, 2020). On the other hand, if the model produces an incorrectly written test suite, it is possible for some of the tests to fail on a correct solution, leading to a self-reflection generation that is conditioned on a false negative code completion. Given the implementation of Reflexion, false negatives are preferred over false positives as the agent may be able to use self-reflection to identify the incorrect test(s) and prompt itself to keep the original code completion intact. On the other hand, if an invalid test suite returns a false positive completion (all internal test cases pass but the implementation is incorrect), the agent will prematurely report an invalid submission. In 2, various conditions are measured to analyze performance beyond pass@1 accuracy. Previously, we displayed the inferior performance of Reflexion to the baseline GPT-4 on MBPP Python. In 2, we observe a notable discrepancy between the false positive labels produced by internal test execution, P(not pass@1 generation correct | tests pass). That is, the probability that a submission will fail given that it passes all unit tests. For HumanEval and MBPP Python, the baseline pass@1 accuracies are relatively similar, 82% and 80%, respectively. However, the false positive test execution rate for MBPP Python is 16.3% while the rate for HumanEval Python is a mere 1.4%, leading to 91% overall accuracy 1.
Ablation study
We test the composite approach of Reflexion for test generation and self-reflection cooperation on a subset of the 50 hardest HumanEval Rust problems. Our Rust compiler environment provides verbose error logs and helpful debugging hints, therefore serving as a good playground for compromised approaches. First, we omit internal test generation and execution steps, which test the agent to self-reflect without guidance from current implementations. 3 shows an inferior 52% vs 60% (baseline) accuracy, which suggests that the agent is unable to determine if the current implementation is correct without unit tests. Therefore, the agent must participate in all iterations of the run without the option to return early, performing harmful edits to the implementation.
Next, we test self-reflection contribution by omitting the natural language explanation step following failed unit test suite evaluations. Intuitively, this challenges the agent to combine the tasks of error identification and implementation improvement across all failed unit tests. Interestingly, the compromised agent does not improve performance over the baseline run. We observe that the test generation and code compilation steps are able to catch syntax and logic errors, but the implementation fixes do not reflect these indications. These empirical results suggest that several recent works that propose blind trial and error debugging techniques without self-reflection are ineffective on harder tasks such as writing complex programs in Rust.
5 Limitations
At its core, Reflexion is an optimization technique that uses natural language to do policy optimization. Policy optimization is a powerful approach to improve action choice through experience, but it may still succumb to non-optimal local minima solutions. In this study, we limit long-term memory to a sliding window with maximum capacity, but we encourage future work to extend the memory component of Reflexion with more advanced structures such as vector embedding databases or traditional SQL databases. Specific to code generation, there are many practical limitations to test-driven development in specifying accurate input-output mappings such as non-deterministic generator functions, impure functions that interact with APIs, functions that vary output according to hardware specifications, or functions that invoke parallel or concurrent behavior that may be difficult to predict.
6 Broader impact
Large language models are increasingly used to interact with external environments (e.g. the Internet, software, robotics, etc.) and humans. Our work has the potential of reinforcing and empowering these agents toward greater automation and work efficiency, but it also amplifies the risks when these agents were put into misuse. We believe that this direction of research will need more effort in safety and ethical considerations.
On the other hand, reinforcement learning has suffered from its black-box policy and optimization setups in which interpretability and alignment have been challenging. Our proposed “verbal” reinforcement learning might address some of the issues and turn autonomous agents more interpretable and diagnosable. For example, in the case of tool-usage that may be too hard for humans to understand, self-reflections could be monitored to ensure proper intent before using the tool.
7 Conclusion
In this work, we present Reflexion, an approach that leverages verbal reinforcement to teach agents to learn from past mistakes. We empirically show that Reflexion agents significantly outperform currently widely-used decision-making approaches by utilizing self-reflection. In future work, Reflexion could be used to employ more advanced techniques that have been thoroughly studied in traditional RL settings, such as value learning in natural language or off-policy exploration techniques.
8 Reproducibility
We highly advise others to use isolated execution environments when running autonomous code writing experiments as the generated code is not validated before execution.
References
Ahn et al., (2022)Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022).Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691.
Austin et al., (2021)Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. (2021).Program synthesis with large language models.arXiv preprint arXiv:2108.07732.
Brooks et al., (2022)Brooks, E., Walls, L., Lewis, R. L., and Singh, S. (2022).In-context policy iteration.arXiv preprint arXiv:2210.03821.
Cassano et al., (2022)Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., and Jangda, A. (2022).Multipl-e: A scalable and extensible approach to benchmarking neural code generation.
Chen et al., (2022)Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. (2022).Codet: Code generation with generated tests.arXiv preprint arXiv:2207.10397.
Chen et al., (2021)Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021).Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374.
Chen et al., (2023)Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023).Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128.
Côté et al., (2019)Côté, M.-A., Kádár, A., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, M., et al. (2019).Textworld: A learning environment for text-based games.In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer.
Goodman, (2023)Goodman, N. (2023).Meta-prompt: A simple self-improving language agent.noahgoodman.substack.com.
Kim et al., (2023)Kim, G., Baldi, P., and McAleer, S. (2023).Language models can solve computer tasks.arXiv preprint arXiv:2303.17491.
Lam et al., (2020)Lam, W., Winter, S., Wei, A., Xie, T., Marinov, D., and Bell, J. (2020).A large-scale longitudinal study of flaky tests.Proc. ACM Program. Lang., 4(OOPSLA).
Le et al., (2022)Le, H., Wang, Y., Gotmare, A. D., Savarese, S., and Hoi, S. C. H. (2022).Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328.
Li et al., (2023)Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023).Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161.
Li et al., (2022)Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022).Competition-level code generation with alphacode.Science, 378(6624):1092–1097.
Madaan et al., (2023)Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. (2023).Self-refine: Iterative refinement with self-feedback.arXiv preprint arXiv:2303.17651.
Nair et al., (2023)Nair, V., Schumacher, E., Tso, G., and Kannan, A. (2023).Dera: Enhancing large language model completions with dialog-enabled resolving agents.arXiv preprint arXiv:2303.17071.
Nakano et al., (2021)Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2021).Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332.
Park et al., (2023)Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. (2023).Generative agents: Interactive simulacra of human behavior.arXiv preprint arXiv:2304.03442.
Paul et al., (2023)Paul, D., Ismayilzada, M., Peyrard, M., Borges, B., Bosselut, A., West, R., and Faltings, B. (2023).Refiner: Reasoning feedback on intermediate representations.arXiv preprint arXiv:2304.01904.
Pryzant et al., (2023)Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M. (2023).Automatic prompt optimization with" gradient descent" and beam search.arXiv preprint arXiv:2305.03495.
Schick et al., (2023)Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023).Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761.
Shen et al., (2023)Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y. (2023).Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.arXiv preprint arXiv:2303.17580.
Shridhar et al., (2021)Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y., Trischler, A., and Hausknecht, M. (2021).ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.In Proceedings of the International Conference on Learning Representations (ICLR).
Sutton and Barto, (2018)Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction.The MIT Press, second edition.
Wei et al., (2022)Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022).Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903.
Xie et al., (2023)Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q. (2023).Decomposition enhances reasoning via self-evaluation guided decoding.arXiv preprint arXiv:2305.00633.
Yang et al., (2018)Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018).HotpotQA: A dataset for diverse, explainable multi-hop question answering.In Conference on Empirical Methods in Natural Language Processing (EMNLP).
(29)Yao, S., Chen, H., Yang, J., and Narasimhan, K. (preprint).Webshop: Towards scalable real-world web interaction with grounded language agents.In ArXiv.
Yao et al., (2023)Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2023).ReAct: Synergizing reasoning and acting in language models.In International Conference on Learning Representations (ICLR).
Yoran et al., (2023)Yoran, O., Wolfson, T., Bogin, B., Katz, U., Deutch, D., and Berant, J. (2023).Answering questions by meta-reasoning over multiple chains of thought.arXiv preprint arXiv:2304.13007.
Appendix A Evaluation with additional models
We further investigated the applicability of trial-and-error problem-solving with models of various strengths. We found that the ability to specify self-corrections is an emergent quality of stronger, larger models.
Appendix B Decision-making
B.1 WebShop Limitation
In 5, we briefly state that Reflexion struggles to overcome local minima choices that require extremely creative behavior to escape. We observe this shortcoming in an experiment on WebShop (Yao et al., rint, ). WebShop is a web-based problem-solving benchmark that tests agents to navigate an e-commerce website to locate and purchase products given requests from clients. We test a two-shot ReAct + Reflexion agent in 100 environments. However, after only four trials, we terminate the runs as the agent does not show signs of improvement 6. Further, the agent does not generate helpful, intuitive self-reflections after failed attempts. We conclude that Reflexion is unable to solve tasks that require a significant amount of diversity and exploration. In AlfWorld, the agent is able to adequately explore new environments because the permissible actions can be seen in the observations. In HotPotQA, the agent faces a similar WebShop search query task but is more successful as the search space for Wikipedia articles is more diverse and requires less precise search queries. A common problem for e-commerce search engines is properly handling ambiguity in natural language search interpretations. Thus, WebShop presents a task that requires very diverse and unique behavior from a Reflexion agent.
Appendix C Programming
Programming LLM calls require strict instructions to produce function bodies only, due to the extensive dialogue training of the LLMs. A few programming examples are reported below with instructions highlighted in blue and templates. See the full implementation at https://github.com/noahshinn024/reflexion.
C.1 Programming function implementation example (HumanEval Python)
You are a Python writing assistant. You will be given your previous implementation of a function, a series of unit tests results, and your self-reflection on your previous implementation. Apply the necessary changes below by responding only with the improved body of the function. Do not include the signature in your response. The first line of your response should have 4 spaces of indentation so that it fits syntactically with the user provided signature. You will be given a few examples by the user.
Reflexion Actor generations follow the form:
(Instruction)
(Function implementation)
(Unit test feedback)
(Self-reflection)
(Instruction for next function implmentation)
C.3 Reflexion Self-reflection instruction and example
You are a Python writing assistant. You will be given your previous implementation of a function, a series of unit tests results, and your self-reflection on your previous implementation. Apply the necessary changes below by responding only with the improved body of the function. Do not include the signature in your response. The first line of your response should have 4 spaces of indentation so that it fits syntactically with the user provided signature. You will be given a few examples by the user. Reflexion Self-Reflection generations follow the form:
(Instruction)
(Function implementation)
(Unit test feedback)
C.4 Reflexion programming no Self-Reflection ablation example
Reflexion no Self-Reflection ablation Actor generations follow the form:
(Instruction)
(Function implementation)
(Unit test feedback)
(Self-reflection)
(Instruction for next function implmentation)
C.5 Reflexion programming no test generation ablation example
Reflexion no test generation ablation Actor generations follow the form:
(Instruction)
(Function implementation)
(Unit test feedback)
(Self-reflection)
(Instruction for next function implmentation)