Created
September 23, 2024 10:25
-
-
Save ezodude/d1efd28ea21c28a51e30a162bbd07e38 to your computer and use it in GitHub Desktop.
An LLM Compiler for Parallel Function Calling (text version for LLMs)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Abstract | |
The reasoning capabilities of the recent LLMs | |
enable them to execute external function calls | |
to overcome their inherent limitations, such as | |
knowledge cutoffs, poor arithmetic skills, or lack | |
of access to private data. This development has | |
allowed LLMs to select and coordinate multiple | |
functions based on the context to tackle more | |
complex problems. However, current methods | |
for function calling often require sequential | |
reasoning and acting for each function which | |
can result in high latency, cost, and sometimes | |
inaccurate behavior. To address this, we introduce | |
LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function | |
calls. Drawing inspiration from the principles of | |
classical compilers, LLMCompiler enables parallel function calling with three components: (i) a | |
Function Calling Planner, formulating execution | |
plans for function calling; (ii) a Task Fetching | |
Unit, dispatching function calling tasks; and (iii) | |
an Executor, executing these tasks in parallel. | |
LLMCompiler automatically generates an optimized orchestration for the function calls and can | |
be used with both open-source and closed-source | |
models. We have benchmarked LLMCompiler | |
on a range of tasks with different patterns of | |
function calling. We observe consistent latency | |
speedup of up to 3.7×, cost savings of up | |
to 6.7×, and accuracy improvement of up to | |
∼9% compared to ReAct. Our code is available at | |
https://github.com/SqueezeAILab/LLMCompiler. | |
1. Introduction | |
Recent advances in the reasoning capability of Large Language Models (LLMs) have expanded the applicability of | |
LLMs beyond content generation to solving complex problems (Besta et al., 2023; Chen et al., 2023b; Gao et al., 2022;Kojima et al., 2023; Wang et al., 2023b; Wei et al., 2022; | |
Yang et al., 2022; Yao et al., 2023b; Zhou et al., 2023b); | |
and recent works have also shown how this reasoning capability can be helpful in improving accuracy for solving | |
complex and logical tasks. The reasoning capability has also | |
allowed function (i.e., tool) calling capability, where LLMs | |
can invoke provided functions and use the function outputs | |
to help complete their tasks. These functions range from a | |
simple calculator that can invoke arithmetic operations to | |
more complex LLM-based functions. The ability of LLMs to integrate various tools and function | |
calls could enable a fundamental shift in how we develop | |
LLM-based software. However, this brings up an important | |
challenge: what is the most effective approach to incorporate multiple function calls? A notable approach has been | |
introduced in ReAct (Yao et al., 2022), where the LLM calls | |
a function, analyzes the outcomes, and then reasons about | |
the next action, which involves a subsequent function call. | |
For a simple example illustrated in Fig. 1 (Left), where the | |
LLM is asked if Scott Derrickson and Ed Wood have the | |
same nationality, ReAct initially analyzes the query and decides to use a search tool to search for Scott Derrickson. The | |
result of this search (i.e., observation) is then concatenated | |
back to the original prompt for the LLM to reason about | |
the next action, which invokes another search tool to gather | |
information about Ed Wood. | |
ReAct has been a pioneering work in enabling function | |
calling, and it has been integrated into several frameworks (Langchain; Liu, 2022). However, scaling this approach for more complex applications requires considerable | |
optimizations. This is due to the sequential nature of ReAct, where it executes function calls and reasons about their | |
observations one after the other. This approach, along with | |
the agent systems that extend ReAct (Khot et al., 2023; Qin | |
et al., 2023; Ruan et al., 2023b; Sumers et al., 2023; Yao | |
et al., 2023b), may lead to inefficiencies in latency and cost, | |
due to the sequential function calling and repetitive LLM | |
invocations for each reasoning and action step. Furthermore, | |
while dynamic reasoning about the observations has benefits | |
in certain cases, concatenating the outcomes of intermediate function calls could disrupt the LLM’s execution flow, | |
potentially reducing accuracy (Xu et al., 2023). Common | |
failure cases include repetitive invocation of the same function, which is also highlighted in the original paper (Yao | |
et al., 2022), and early stopping based on the partial intermediate results, as will be further discussed in Sec. 5.1 and | |
Appendix A. | |
To address this challenge, we draw inspiration from classical compilers, where optimizing instruction executions | |
in traditional programming languages has been extensively | |
explored. A key optimization technique in compilers involves identifying instructions that can be executed in parallel and effectively managing their dependencies. Similarly, | |
one can envision a compiler, tailored for LLM function | |
calling, which can efficiently orchestrate various function | |
calls and their dependencies. This shares a similar philosophy with the recent studies that align LLMs with computer systems (Karpathy, 2023; Packer et al., 2023). To | |
this end, we introduce LLMCompiler, a novel framework | |
that enables parallel multi-tool execution of LLMs across | |
different models and workloads. To the best of our knowledge, LLMCompiler is the first framework to optimize | |
the orchestration of LLM function calling that can not only | |
improve latency and cost, but also accuracy, by minimizing | |
interference from the outputs of intermediate function calls. | |
In more detail, we make the following contributions: | |
• We introduce LLMCompiler, an LLM compiler that | |
optimizes the parallel function calling performance of | |
LLMs. At a high level, this is achieved by introducing | |
three key components: (i) a Function Calling Planner | |
(Sec. 3.1) that identifies an execution flow; (ii) a Task Fetching Unit (Sec. 3.2) that dispatches the function calls | |
in parallel; (iii) an Executor (Sec. 3.3) that executes the | |
dispatched tasks using the associated functions. | |
• We evaluate LLMCompiler on embarrassingly parallel | |
patterns using HotpotQA (Yang et al., 2018) and Movie | |
Recommendation (Srivastava et al., 2022), where we observe 1.80×/3.74× speedup and 3.37×/6.73× cost reduction compared to ReAct (Sec. 5.1). | |
• To test the performance on more complex patterns, we | |
introduce a new benchmark called ParallelQA which includes various non-trival function calling patterns. We | |
show up to 2.27× speedup, 4.65× cost reduction, and 9% | |
improved accuracy compared to ReAct (Sec. 5.2). | |
• We evaluate LLMCompiler’s capability in dynamic replanning, which is achieved through a feedback loop | |
from the Executor back to our Function Calling Planner. | |
For the Game of 24 (Yao et al., 2023b), which requires | |
repeated replanning based on the intermediate results, | |
LLMCompiler demonstrates a 2× speedup compared | |
to Tree-of-Thoughts (Sec. 5.3). | |
• We show that LLMCompiler can explore the interactive | |
decision-making environment effectively and efficiently. | |
On WebShop, LLMCompiler achieves up to 101.7× | |
speedup and 25.7% improved success rate compared to | |
the baselines (Sec. 5.4). | |
2. Related Work | |
2.1. Latency Optimization in LLMs | |
Various studies have focused on optimizing model design (Chen et al., 2023a; Dettmers et al., 2023; Frantar & | |
Alistarh, 2023; Frantar et al., 2022; Kim et al., 2023; 2024; | |
Kwon et al., 2022; Leviathan et al., 2023; Lin et al., 2023) | |
and systems (tgi; trt; Kwon et al., 2023; Yu et al., 2022) for | |
efficient LLM inference. Optimizations at the application | |
level, however, are less explored. This is critical from a practical point of view for situations involving black-box LLM | |
models and services where modifications to the models and | |
the underlying inference pipeline are highly restricted. | |
Skeleton-of-Thought (Ning et al., 2023) recently proposed | |
to reduce latency through application-level parallel decoding. This method involves a two-step process of an initial | |
skeleton generation phase, followed by parallel execution | |
of skeleton items. However, it is primarily designed for | |
embarrassingly parallel workloads and does not support | |
problems that have inherently interdependent tasks, as it | |
assumes no dependencies between skeleton tasks. This | |
limits its applicability in complex scenarios such as coding (Austin et al., 2021; Chen et al., 2021; Hendrycks et al., | |
2021a; Madaan et al., 2023) or math (Hendrycks et al., | |
2021b;c) problems, as also stated in the paper (Ning et al., | |
2023). LLMCompiler addresses this by translating an | |
input query into a series of tasks with inter-dependencies, | |
thereby expanding the spectrum of problems it can handle. | |
Concurrently to our work, OpenAI has recently introduced | |
a parallel function calling feature in their 1106 release, enhancing user query processing through the simultaneous | |
generation of multiple function calls (OpenAI, 2023). Despite its potential for reducing LLM execution time, this | |
feature has certain limitations, as it is exclusively available for OpenAI’s proprietary models. However, there is | |
a growing demand for using open-source models driven | |
by the increasing number of open-source LLMs as well | |
as parameter-efficient training techniques (Houlsby et al., | |
2019; Hu et al., 2022; Lester et al., 2021) for finetuning and | |
customization. LLMCompiler enables efficient parallel | |
function calling for open-source models, and also, as we | |
will show later in Sec. 5, it can potentially achieve better | |
latency and cost. | |
2.2. Plan and Solve Strategy | |
Several studies (Hao et al., 2023; Patel et al., 2022; Press | |
et al., 2023; Wolfson et al., 2020; Zhou et al., 2023b) have | |
explored prompting methods of breaking down complex | |
queries into various levels of detail to solve them, thereby | |
improving LLM’s performance in reasoning tasks. Specifically, Decomposed Prompting (Khot et al., 2023) tackles | |
complex tasks by decomposing them into simpler sub-tasks, each optimized through LLMs with dedicated prompts. | |
Step-Back Prompting (Zheng et al., 2023) enables LLMs to | |
abstract high-level concepts from details to enhance reasoning abilities across various tasks. Plan-and-Solve Prompting (Wang et al., 2023a) segments multi-step reasoning | |
tasks into subtasks to minimize errors and improve task | |
accuracy without manual prompting. However, these methods primarily focus on improving the accuracy of reasoning | |
benchmarks. In contrast, LLMCompiler uses a planner | |
to identify parallelizable patterns within queries, aiming to | |
reduce latency while maintaining accuracy. | |
In addition to the aforementioned works, TPTU (Ruan | |
et al., 2023a), HuggingGPT (Shen et al., 2023), and | |
ViperGPT (Sur´ıs et al., 2023) have introduced end-to-end | |
plan-and-solve frameworks. LLMCompiler sets itself | |
apart by providing a general framework that enables efficient and accurate function calling in a broader range of | |
problems. This stems from LLMCompiler’s capabilities | |
in (i) planning and replanning; (ii) parallel execution; and | |
(iii) addressing a wider range of problem domains, which | |
will be discussed in more detail in Appendix F. | |
Another notable work is ReWOO (Xu et al., 2023) which | |
employs a planner to separate the reasoning process from the | |
execution and observation phases to decrease token usage | |
and cost as compared to ReAct. Our approach is different | |
from ReWOO in multiple aspects. First, LLMCompiler | |
allows parallel function calling which can reduce latency | |
as well as cost. Second, LLMCompiler supports dynamic | |
replanning which is important for problems whose execution flow cannot be determined statically in the beginning | |
(Sec. 5.3). | |
2.3. Tool-Augmented LLMs | |
The enhanced reasoning capability of LLMs has enabled | |
them to invoke user-provided functions and use their outputs | |
to effectively complete tasks. Detailed exploration of this | |
subject is provided in the Appendix C.1. | |
3. Methodology | |
To illustrate the components of LLMCompiler, we use a | |
simple 2-way parallel example in Fig. 2. To answer “How | |
much does Microsoft’s market cap need to increase to exceed Apple’s market cap?,” the LLM first needs to conduct | |
web searches for both companies’ market caps, followed by | |
a division operation. While the existing frameworks, including ReAct, perform these tasks sequentially, it is evident that | |
they can be executed in parallel. The key question is how to | |
automatically determine which tasks are parallelizable and | |
which are interdependent, so we can orchestrate the execution of the different tasks accordingly. LLMCompiler accomplishes this through a system that consists of the follow ing three components: a Function Calling Planner (Sec. 3.1) | |
that generates a sequence of tasks and their dependencies; a | |
Task Fetching Unit (Sec. 3.2) that replaces arguments based | |
on intermediate results and fetches the tasks; and an Executor (Sec. 3.3) that executes the tasks with associated tools. | |
To use LLMCompiler, users are only required to provide | |
tool definitions, and optional in-context examples for the | |
Planner, as will be further discussed in Sec. 4.1. | |
3.1. Function Calling Planner | |
The Function Calling Planner is responsible for generating a | |
sequence of tasks to be executed along with any dependency | |
among them. For instance, Tasks $1 and $2 in Fig. 2 are | |
two independent searches that can be performed in parallel. However, Task $3 has a dependency on the outcomes | |
of the first and second searches. Therefore, the Planner’s | |
role is to automatically identify the necessary tasks, their | |
input arguments, as well as their inter-dependencies using | |
the sophisticated reasoning capability of LLMs, essentially | |
forming a directed acyclic graph of task dependencies. If | |
a task is dependent on a preceding task, it incorporates a | |
placeholder variable, such as $1 in Task 3 of Fig. 2, which | |
will later be substituted with the actual output from the | |
preceding task (Sec. 3.2). | |
The Planner in LLMCompiler leverages LLMs’ reasoning | |
capability to decompose tasks from natural language inputs. | |
To achieve this, the Planner LLM incorporates a pre-defined | |
prompt that guides it on how to create dependency graphs | |
and to ensure correct syntax (see Appendix H for details). | |
Besides this, users also need to supply tool definitions and | |
optional in-context examples for the Planner. These examples provide detailed demonstrations of task decomposition | |
specific to a problem, helping the Planner to better understand the rules. Further details on user-supplied information | |
for LLMCompiler are elaborated in Sec. 4.1. In Sec. 4.2, | |
we introduce an additional optimization for the Planner that streams tasks as soon as they are created, instead of waiting | |
to complete the entire planning process. | |
3.2. Task Fetching Unit | |
The Task Fetching Unit, inspired by the instruction fetching | |
units in modern computer architectures, fetches tasks to the | |
Executor as soon as they are ready for (parallel) execution | |
based on a greedy policy. Another key functionality is | |
to replace variables with the actual outputs from preceding | |
tasks, which were initially set as placeholders by the Planner. | |
For the example in Fig. 2, the variable $1 and $2 in Task $3 | |
would be replaced with the actual market cap of Microsoft | |
and Apple. This can be implemented with a simple fetching | |
and queuing mechanism without a dedicated LLM. | |
3.3. Executor | |
The Executor asynchronously executes tasks fetched from | |
the Task Fetching Unit. As the Task Fetching Unit guarantees that all the tasks dispatched to the Executor are independent, it can simply execute them concurrently. The | |
Executor is equipped with user-provided tools, and it delegates the task to the associated tool. These tools can be | |
simple functions like a calculator, Wikipedia search, or API | |
calls, or they can even be LLM agents that are tailored for a | |
specific task. As depicted in the Executor block of Fig. 2, | |
each task has dedicated memory to store its intermediate | |
outcomes, similar to what typical sequential frameworks | |
do when aggregating observations as a single prompt (Yao | |
et al., 2022). Upon completion of the task, the final results | |
are forwarded as input to the tasks dependent on them. | |
3.4. Dynamic Replanning | |
In various applications, the execution graph may need to | |
adapt based on intermediate results that are a priori unknown. | |
An analogy in programming is branching, where the path of execution is determined only during runtime, depending | |
on which branch conditions are satisfied. Such dynamic execution patterns can also appear with LLM function calling. | |
For simple branching (e.g., if-else statements) one could | |
statically compile the execution flow and choose the right | |
dynamically based on the intermediate results. However, for | |
more complex branching it may be better to do a recompilation or replanning based on the intermediate results. | |
When replanning, the intermediate results are sent back | |
from the Executor to the Function Calling Planner which | |
then generates a new set of tasks with their associated dependencies. These tasks are then sent to the Task Fetching | |
Unit and subsequently to the Executor. This cycle continues | |
until the desired final result is achieved and can be delivered | |
to the user. We show an example use case of this in Sec. 5.3 | |
for solving the Game of 24 using the Tree-of-Thoughts | |
approach. | |
4. LLMCompiler Details | |
4.1. User-Supplied Information | |
LLMCompiler requires two inputs from the user: | |
1. Tool Definitions: Users need to specify the tools that | |
LLMs can use, including their descriptions and argument | |
specifications. This is essentially the same requirement | |
as other frameworks like ReAct and OpenAI function | |
calling. | |
2. In-context Examples for the Planner: Optionally, | |
users can provide LLMCompiler with examples of | |
how the Planner should behave. For instance, in the case | |
of Fig. 2, users may provide examples illustrating expected inter-task dependencies for certain queries. These | |
examples can assist the Planner LLM understand how | |
to use various tools and generate the appropriate dependency graph for incoming inputs in the correct format. | |
In Appendix G, we include the examples that we used | |
in our evaluations. | |
4.2. Streamed Planner | |
The Planner may incur a non-trivial overhead for user | |
queries that involve a lot of tasks as it blocks the Task Fetching Unit and the Executor, which must wait for the Planner | |
output before initiating their processes. However, analogous | |
to instruction pipelining in modern computer systems, this | |
can be mitigated by enabling the Planner to asynchronously | |
stream the dependency graph, thereby allowing each task | |
to be immediately processed by the Executor as soon as its | |
dependencies are all resolved. In Table C.1, we present a | |
latency comparison of LLMCompiler with and without | |
the streaming mechanism across different benchmarks. The | |
results demonstrate consistent latency improvements with | |
streaming. Particularly, in the ParallelQA benchmark, the streaming feature leads to a latency gain of up to 1.3×. This | |
is attributed to the math tool’s longer execution time for | |
ParallelQA, which can effectively hide the Planner’s latency | |
in generating subsequent tasks, unlike the shorter execution | |
times of the search tool used in HotpotQA and Movie | |
Recommendation. | |
5. Results | |
In this section, we evaluate LLMCompiler using a variety | |
of models and problem types. We use both the proprietary | |
GPT models and the open-source LLaMA-2 model, with | |
the latter demonstrating LLMCompiler’s capability in enabling parallel function calling in open-source models. Furthermore, there are various types of parallel function calling | |
patterns that can be addressed with LLMs. This ranges | |
from embarrassingly parallel patterns, where all tasks can | |
be executed in parallel without any dependencies between | |
them, to more complex dependency patterns, as illustrated | |
in Fig. 3. Importantly, we also assess LLMCompiler | |
on the Game of 24 benchmark, which involves dynamic | |
replanning based on intermediate results, highlighting its | |
adaptability to dynamic dependency graphs. Finally, we | |
apply LLMCompiler to the WebShop benchmark to showcase its potential in decision-making tasks. Overall, we start | |
presenting results for simple execution patterns, and then | |
we move to more complex ones. | |
5.1. Embarrassingly Parallel Function Calling | |
The simplest scenario involves an LLM using a tool repeatedly for independent tasks such as conducting parallel | |
searches or analyses to gather information on different topics, like the pattern depicted in Fig. 3 (a). While these tasks | |
are independent of each other and can be executed in parallel, ReAct, along with other LLM solutions as they stand, | |
would need to run sequentially. This leads to increased | |
latency and token consumption due to its frequent LLM invocations for each tool usage, as also illustrated in Fig. 1. In | |
this section, we demonstrate how LLMCompiler can identify parallelizable patterns and execute independent tasks | |
concurrently to resolve this issue. To do so, we use the | |
following two benchmarks: | |
• HotpotQA: A dataset that evaluates multi-hop reasoning (Yang et al., 2018). We only use the comparison dev | |
set. This contains 1.5k questions comparing two different | |
entities, thus exhibiting a 2-way embarrassingly parallel | |
execution pattern. An example question is shown in Fig. 1. | |
• Movie Recommendation: A dataset with 500 examples | |
that asks to identify the most similar movie out of four | |
options to another set of four movies, exhibiting an 8-way | |
embarrassingly parallel pattern (Srivastava et al., 2022). | |
Experimental Setups. As a baseline method, we compare LLMCompiler with ReAct. We follow the ReAct | |
setup (Yao et al., 2022) using the same Wikipedia search | |
tool that LLMs can use to search for information. We did | |
not include the lookup tool since it is not relevant to our | |
problem setting. We have optimized the prompt and incontext examples for both ReAct and LLMCompiler to | |
the best of our abilities. For all experiments across these datasets, we use gpt-3.5-turbo (1106 release). For the experiments using GPT, we additionally report the results using | |
OpenAI’s parallel function calling capability, which was | |
announced concurrently with our work. We also show how | |
LLMCompiler can be effectively combined with the opensource LLaMA-2 70B model to provide the model with | |
parallel function calling capabilities. For all experiments, | |
we have measured accuracy, end-to-end latency, as well as | |
input and output token usage. See Appendix D for details | |
on experimental setups. | |
Accuracy and Latency. We report the accuracy, end-toend latency, and relative speed-up of LLMCompiler compared to ReAct in Tab. 1. First, we observe that ReAct | |
consistently achieves lower accuracy compared to OpenAI | |
parallel function calling and LLMCompiler. We identify | |
two main failure modes in ReAct: (1) the tendency for redundant generation of prior function calls, a point also noted | |
in the original ReAct paper (Yao et al., 2022); and (2) premature early stopping based on the incomplete intermediate | |
results. In Appendix A, we offer a detailed analysis demonstrating how these two prevalent failure cases significantly hurt ReAct’s accuracy, and how they can be resolved with | |
LLMCompiler, leading to an accuracy enhancement of | |
up to 7 – 8%. Furthermore, we have conducted interventional experiments in which we incorporated ReAct-specific | |
prompts to avoid repetitive function calls and early stopping. | |
ReAct† | |
in Tab. 1 refers to ReAct with this ReAct-specific | |
prompt. The ReAct-specific prompt yields a general accuracy improvement with ReAct† | |
as compared to the original | |
ReAct. Nevertheless, LLMCompiler still demonstrates | |
on-par and better accuracy than ReAct† | |
, as such prompting | |
does not serve as a perfect solution to completely avoiding | |
the erroneous behavior of ReAct. | |
Additionally, when compared to ReAct† | |
, LLMCompiler | |
demonstrates a noticeable speedup of 1.80× and 1.40× on | |
HotpotQA with GPT and LLaMA, respectively. Similarly, | |
LLMCompiler demonstrates 3.74× and 2.82× speedup | |
on Movie Recommendation with each model. Note that | |
we benchmark the latency of LLMCompiler against that | |
of ReAct† | |
since the repeating and early stopping behavior of the original ReAct as discussed above makes its latency unpredictable and unsuitable for a fair comparison. | |
LLMCompiler demonstrates a speedup of up to 35% compared to OpenAI parallel function calling whose latency | |
gain over ReAct is 1.61× and 2.76× on each benchmark.1 | |
Costs. Another important consideration of using LLMs | |
is cost, which depends on the input and output token usage. The costs for GPT experiments are provided in Tab. 2. | |
LLMCompiler is more cost-efficient than ReAct for cost, | |
as it involves less frequent LLM invocations. Interestingly, | |
LLMCompiler also outperforms the recent OpenAI parallel function calling in cost efficiency. This is because | |
LLMCompiler’s planning phase is more prompt length | |
efficient than that of OpenAI parallel function calling since | |
our Planner’s in-context examples are rather short and only | |
include plans, not observations (see Appendix H). | |
5.2. Parallel Function Calling with Dependencies | |
The cases considered above are rather simple, as only one | |
tool is used and all tasks can be executed independently | |
of one another. However, similar to code execution in traditional code blocks, we may encounter function calling | |
scenarios that involve more complex dependencies. To systematically evaluate the capability to plan out function calling in scenarios that involve complex task dependencies, we | |
have designed a custom benchmark called ParallelQA. This | |
benchmark is designed to incorporate non-trivial function | |
calling patterns, including three different types of patterns in Fig. 3 (b) and (c). Inspired by the IfQA benchmark (Yu | |
et al., 2023), ParallelQA contains 113 examples that involve | |
mathematical questions on factual attributes of various entities. In particular, completing the task requires using two | |
tools (i.e., search and math tools), with the second tool’s argument depending on the result of the first tool’s output. We | |
have meticulously included questions that are answerable | |
only with information from Wikipedia’s first paragraph, effectively factoring out the failure cases due to unsuccessful | |
searches. See Appendix I for more details in ParallelQA. | |
Experimental Setups. Similar to Sec. 5.1, we use ReAct (Yao et al., 2022) as the main baseline. Here, both ReAct and LLMCompiler are equipped with two tools: (1) | |
the search tool, identical to the one mentioned in Sec.5.1; | |
and (2) the math tool, which solves mathematical problems. | |
The math tool is inspired by the Langchain (Langchain)’s | |
LLMMathChain, which uses an LLM as an agent that interprets input queries and invokes the numexpr function | |
with the appropriate formula. This enables the math chain to | |
address a broad spectrum of math problems that are written | |
both in mathematical and verbal form. See Appendix D for | |
more details on experimental setups. | |
Accuracy and Latency. As shown in the ParallelQA row | |
of Tab. 1, LLMCompiler arrives at the final answer with | |
an average speedup of 2.15× with gpt-4-turbo and 2.27× | |
with LLaMA-2 70B, by avoiding sequential execution of | |
the dependency graphs. Beyond the latency speedup, we observe higher accuracy of LLMCompiler with the LLaMA2 model as compared to that of ReAct, due to the reasons discussed in Sec. 5.1. Particularly in the LLaMA2 experiment, where LLMCompiler achieves around a | |
9% increase in accuracy, we note that ∼20% of the examples experienced repetitive function calls with ReAct, | |
aligning with our observations from the accuracy analysis detailed in Appendix A. Additionally, a comprehensive | |
analysis of LLMCompiler’s failure cases is provided in | |
Appendix B, where we note minimal Planner failures, highlighting LLMCompiler’s effectiveness in breaking down | |
problems into complex multi-task dependencies. | |
Cost. Similar to Sec. 5.1, LLMCompiler demonstrates | |
substantial cost reductions of 4.65× and 2.57× compared | |
to ReAct and OpenAI’s parallel function calling, respectively, as indicated in Tab. 2. This efficiency stems from | |
LLMCompiler’s reduced frequency of LLM invocations, | |
which is also the case with OpenAI’s parallel function calling, which is limited to planning out immediate parallelizable tasks, not the entire dependency graph. For example, in | |
Fig. 3 (c), OpenAI’s method would necessitate three distinct | |
LLM calls for initial search tasks, following math tasks, and | |
the final math task. In contrast, LLMCompiler achieves | |
this with a single LLM call, planning all tasks concurrently. | |
5.3. Parallel Function Calling with Replanning | |
In the previous sections, we have discussed cases in which | |
dependency graphs can be determined statically. However, | |
there are cases where dependency graphs need to be constructed dynamically depending on intermediate observations. Here, we consider one such dynamic approach in the | |
context of the Game of 24 with the Tree-of-Thoughts (ToT) | |
strategy proposed in (Yao et al., 2023b). The Game of 24 is | |
a task to generate 24 using a set of four numbers and basic | |
arithmetic operations. For example, from the numbers 2, 4, | |
4, and 7, a solution could be 4 × (7 − 4) × 2 = 24. ToT | |
approaches this task through two iterative LLM processes: | |
(i) the thought proposer generates candidate partial solutions | |
by selecting two numbers and applying an operation (e.g. | |
2, 3, 7 from 2, 4, 4, 7 by calculating 7 - 4); (ii) the state | |
evaluator assesses the potential of each candidate. Only | |
the promising candidates are then processed in subsequent | |
iterations of the thought proposer and state evaluator until | |
24 is reached. Details about the Game of 24 benchmark and | |
the ToT strategy can be found in Appendix J. | |
While ToT achieves significant improvement at solving | |
the Game of 24, its sequential, breadth-first search approach through the state tree can be time-consuming. | |
LLMCompiler offers a faster alternative by enabling parallel execution of the thought proposer and the subsequent | |
feasibility evaluator, akin to a parallel beam search method. | |
Experimental Setups. Although LLMCompiler offers | |
latency advantages, solving this problem with a single | |
static graph is not feasible, as the Planner cannot plan | |
out the thought proposing stage before identifying the | |
selected candidates from the state evaluator of the previous iteration. Consequently, the Planner is limited to | |
planning only within one iteration at a time. To address | |
this, we resort to LLMCompiler’s replanning capability. In particular, LLMCompiler is equipped with three | |
tools: thought proposer and state evaluator, | |
which are both LLMs adapted from the original ToT | |
framework, and top k select, which chooses the top | |
k candidates from the thought proposer based on the | |
state evaluator’s assessment. After all these tools | |
are executed, LLMCompiler can decide to “replan” if no | |
proposal reaches 24, triggering the Planner to devise new | |
plans using the shortlisted states from top k select of | |
the previous iteration. In this way, LLMCompiler can dynamically regenerate plans of each iteration, being able to | |
tackle highly complex tasks that require iterative replanning | |
based on the outcomes of previous plans. | |
To evaluate LLMCompiler’s performance on the Game of | |
24, we use 100 different instances of the game. For each | |
problem, we consider the output as successful if its operations are valid and yield 24 while also using the provided numbers exactly once each. Further details on experiment setups are outlined in Appendix D. | |
Success Rate and Latency. In the last two rows of Tab. 1, | |
we explore the latency and success rate of LLMCompiler | |
in comparison to the baseline described in (Yao et al., 2023b) | |
on the Game of 24 benchmark. With the gpt-4 model, | |
LLMCompiler demonstrates a 2.89× enhancement in latency while slightly improving the success rate compared | |
to the baseline. Similarly, when applied with the LLaMA-2 | |
model, LLMCompiler shows a 2.01× improvement in latency, again without compromising on success rate. These | |
results demonstrate not only a significant latency reduction | |
without quality degradation, but also the replanning capability of LLMCompiler for solving complex problems. | |
5.4. Application: LLMCompiler in Interactive Decision | |
Making Tasks | |
In this section, we demonstrate that LLMCompiler can | |
explore language-based interactive environments effectively | |
by benchmarking LLMCompiler on WebShop (Yao et al., | |
2023a). As highlighted in (Shinn et al., 2023; Yao et al., | |
2022; 2023a), WebShop exhibits considerable diversity, | |
which requires extensive exploration to purchase the most | |
appropriate item. While recent work feature advanced exploration strategies and show promising results (Ma et al., 2023; | |
Zhou et al., 2023a), their approaches are largely based on a | |
sequential and extensive tree search that incurs significant | |
latency penalties. Here, LLMCompiler showcases an exploration strategy that is both effective and efficient with the | |
use of parallel function calling. Our method enables broader | |
exploration of items in the environment, which improves | |
success rate compared to ReAct. At the same time, this exploration can be parallelized, yielding up to 101.7× speedup | |
against baselines that perform sequential exploration. | |
Experimental Setups. We evaluate LLMCompiler | |
against three baselines on this benchmark, ReAct (Yao | |
et al., 2022), LATS (Zhou et al., 2023a), and LASER (Ma | |
et al., 2023), using 500 WebShop instructions. The evaluation metrics are success rate, average score, and latency. | |
More details of the WebShop environment and the baseline | |
methods are provided in Appendix K. For this experiment, | |
LLMCompiler is equipped with two tools: search and | |
explore. The search function triggers the model to | |
generate and dispatch a query that returns a list of typically | |
ten items from the Webshop environment. The explore | |
function then clicks through links for each of the found | |
items and retrieves information about options, prices, attributes, and features that are available. Finally, based on | |
the gathered information, LLMCompiler decides on the | |
item that best matches the input instruction for purchasing. | |
Further details on experiments can be found in Appendix D. | |
Performance and Latency. Our approach significantly | |
outperforms all baseline models as shown in Table 3. When | |
using gpt-3.5-turbo, LLMCompiler achieves a 28.4% | |
and 6% improvement in success rate against ReAct and | |
LATS; with gpt-4, our method improves upon ReAct and | |
LASER by 20.4% and 5.6%, respectively. In terms of | |
latency, LLMCompiler exhibits a 101.7× and 2.69× | |
speedup against LATS and LASER. While we note that | |
LLMCompiler execution is slightly slower than ReAct on | |
this benchmark, mainly due to the Planner overhead, we | |
also highlight that the gains in success rate far outweigh the | |
minor latency penalty. | |
We further delve into why LLMCompiler attains such an | |
improved success rate and score compared to ReAct. Based | |
on our observations, we discover that the ReAct agent tends | |
to commit to a decision with imperfect information, a scenario that can arise when the agent has not gathered sufficient details about the features and options available for | |
items. This observation was also noted in (Shinn et al., | |
2023) – without exploring more items in the environment, | |
the agent struggles to differentiate between seemingly similar choices, ultimately failing to make the correct decision. | |
In contrast, LLMCompiler undergoes further exploration | |
by visiting all ten items found by search and retrieving | |
relevant information about each item. We find that employing an effective search strategy is critical to decision-making | |
tasks such as the WebShop benchmark. | |
The relatively high performance of LATS can also be explained in terms of its exploration scheme. In this framework, the agent executes a brute-force search through the | |
state and action space of Webshop, exploring as many as | |
30 trajectories before making the final purchase. While this | |
approach provides richer information for decision-making, | |
the end-to-end execution becomes prohibitively slow. | |
We report that our method, LLMCompiler, outperforms | |
LASER by an average score of 1.5. When compared to | |
LATS, this score is within the standard deviation range of | |
our method. The average score for LLMCompiler, along | |
with its standard deviation, is 72.8 ± 4.01 for gpt-3.5-turbo. | |
Further note that while the performance differences are | |
marginal, our method exhibits significant execution speedup, | |
101.7× over LATS and 2.69× over LASER. | |
6. Conclusions | |
Existing methods for invoking multiple functions with | |
LLMs resort to sequential and dynamic reasoning. As a | |
result, they suffer from inefficiencies in latency, cost, and | |
accuracy. As a solution, we introduced LLMCompiler, | |
a compiler-inspired framework that enables efficient parallel function calling across various LLMs, including opensource models like LLaMA-2 and OpenAI’s GPT series. By decomposing user inputs into tasks with defined | |
inter-dependencies and executing these tasks concurrently | |
through its Planner, Task Fetching Unit, and Executor components, LLMCompiler demonstrates substantial improvements in latency (up to 3.7×), cost efficiency (up to 6.7×), | |
and accuracy (up to ∼9%), even outperforming OpenAI’s | |
parallel function calling feature in latency gains. We look | |
forward to future work building upon LLMCompiler that | |
will improve both the capabilities and efficiencies of LLMs | |
in executing complex, large-scale tasks, thus transforming | |
the future development of LLM-based applications. | |
Impact Statement | |
This paper presents research towards advancing the field of | |
Machine Learning. While there are many potential societal | |
consequences of our work, we do not find any one to be | |
particularly noteworthy. | |
Acknowledgements | |
We appreciate the valuable feedback from Minwoo Kang. | |
We acknowledge gracious support from Furiosa team. We | |
also appreciate the support from Microsoft through their | |
Accelerating Foundation Model Research, including great | |
support from Sean Kuno. Furthermore, we appreciate support from Google Cloud, the Google TRC team, and specifically Jonathan Caton, and Prof. David Patterson. Prof. | |
Keutzer’s lab is sponsored by the Intel corporation, Intel | |
One-API, Intel VLAB team, the Intel One-API center of | |
excellence, as well as funding through BDD and BAIR. We | |
also appreciate support from Samsung including Dongkyun | |
Kim, and David Thorsley. We appreciate the great support | |
from Ellick Chan, Saurabh Tangri, Andres Rodriguez, and | |
Kittur Ganesh. Sehoon Kim and Suhong Moon would like | |
to acknowledge the support from the Korea Foundation for | |
Advanced Studies. Amir Gholami was supported through | |
funding from Samsung SAIT. Michael W. Mahoney would | |
also like to acknowledge a J. P. Morgan Chase Faculty Research Award as well as the DOE, NSF, and IARPA. Our | |
conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. | |
G. User-Supplied Examples for LLMCompiler Configuration | |
LLMCompiler provides a simple interface that allows for tailoring the framework to different use cases by providing tool | |
definitions as well as optional in-context examples for the Planner. Below, we provide the Planner example prompts that are | |
used to set up the framework for the Movie Recommendation and Game of 24 benchmarks with only a few lines of prompts. | |
G.1. Movie Recommendation Example Prompts | |
### | |
Question: Find a movie similar to Mission Impossible, The Silence of the | |
Lambs, American Beauty, Star Wars Episode IV - A New Hope | |
Options: | |
Austin Powers International Man of Mystery | |
Alesha Popvich and Tugarin the Dragon | |
In Cold Blood | |
Rosetta | |
1. search("Mission Impossible") | |
2. search("The Silence of the Lambs") | |
3. search("American Beauty") | |
4. search("Star Wars Episode IV - A New Hope") | |
5. search("Austin Powers International Man of Mystery") | |
6. search("Alesha Popvich and Tugarin the Dragon") | |
7. search("In Cold Blood") | |
8. search("Rosetta") | |
Thought: I can answer the question now. | |
9. finish() | |
### | |
H. Pre-defined LLMCompiler Planner Prompts | |
### | |
The pre-defined LLMCompiler Planner prompt provides it with specific instructions on how to break down tasks and | |
generate dependency graphs while ensuring that the associated syntax is formatted correctly. This prompt contains specific | |
rules such as assigning each task to a new line, beginning each task with a numerical identifier, and using the $ sign to | |
denote intermediate variables. | |
### | |
In addition to user-provided functions, the Planner includes a special, hard-coded finish function. The Planner uses this | |
function either when the plan is sufficient to address the user query or when it can no longer proceed with planning before | |
executing the current plan, i.e., when it deems replanning necessary. When the Planner outputs the finish function, its | |
plan generation stops. Refer to Appendix G for examples of the Planner’s usage of the finish function in planning. The | |
definition of the finish function is as below and is included as a prompt to the Planner along with the definitions of other | |
user-provided functions. | |
### | |
finish(): | |
- Collects and combines results from prior actions. | |
- A LLM agent is called upon invoking join to either finalize the user | |
query or wait until the plans are executed. | |
- join should always be the last action in the plan, and will be called in | |
two scenarios: | |
(a) if the answer can be determined by gathering the outputs from tasks to | |
generate the final response. | |
(b) if the answer cannot be determined in the planning phase before you | |
execute the plans. | |
### |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment