The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works.
Here's what happens in the naive implementation:
# What the tutorials show you
while not complete:
response = llm.complete(entire_conversation_history + tool_schemas)
if response.has_tool_call:
result = execute_tool(response.tool_call)
conversation_history.append(result) # History grows every call
Why this sucks:
- Each round-trip: 200-500ms LLM latency
- Token cost grows linearly with conversation length
- Tool schemas sent every single time (often 1000+ tokens)
- Sequential blocking - can't parallelize
Five tool calls = 2.5 seconds minimum. That's before any actual execution time.
Don't loop. Make the LLM output all tool calls upfront:
def get_execution_plan(task):
prompt = f"""
Task: {task}
Output a complete execution plan as JSON.
Include all API calls needed, with dependencies marked.
"""
plan = llm.complete(prompt, response_format={"type": "json"})
return json.loads(plan)
# Example output:
{
"parallel_groups": [
{
"group": 1,
"calls": [
{"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}},
{"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}}
]
},
{
"group": 2, # Depends on group 1
"calls": [
{"tool": "compare_temps", "args": {"results": "$group1.results"}}
]
}
]
}
Now execute the entire plan locally. One LLM call instead of five.
Common sequences should never hit the LLM:
COMPILED_CHAINS = {
"user_context": [
("get_user", lambda: {"id": "$current_user"}),
("get_preferences", lambda prev: {"user_id": prev["id"]}),
("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}),
("aggregate_context", lambda prev: prev)
]
}
def execute_request(query):
# Try to match against compiled patterns first
if pattern := detect_pattern(query):
return execute_compiled_chain(COMPILED_CHAINS[pattern])
# Only use LLM for novel requests
return llm_tool_loop(query)
80% of your tool calls are repetitive. Compile them.
Start executing before the LLM finishes responding:
async def stream_execute(prompt):
results = {}
pending = set()
async for chunk in llm.stream(prompt):
# Try to parse partial JSON for tool calls
if tool_call := try_parse_streaming_json(chunk):
if tool_call not in pending:
pending.add(tool_call)
# Execute immediately, don't wait for full response
task = asyncio.create_task(execute_tool(tool_call))
results[tool_call.id] = task
# Gather all results
return await asyncio.gather(*results.values())
Saves 100-200ms per request by overlapping LLM generation with execution.
Never send full conversation history. Send deltas:
class CompressedContext:
def __init__(self):
self.task_summary = None
self.last_result = None
self.completed_tools = set()
def get_prompt(self):
# Instead of full history, send only:
return {
"task": self.task_summary, # 50 tokens vs 500
"last_result": compress(self.last_result), # Key fields only
"completed": list(self.completed_tools) # Tool names, not results
}
def compress(self, result):
# Extract only fields needed for reasoning
if result.type == "weather":
return {"temp": result["temp"], "summary": result["condition"]}
# Full result stored locally, LLM never sees it
return {"id": result.id, "success": True}
Reduces token usage by 85% after 5+ tool calls.
Design your tools to accept multiple operations:
# Instead of:
get_weather(city="Boston", date="2024-01-20")
get_weather(city="NYC", date="2024-01-20")
# Design tools that batch:
get_weather_batch(requests=[
{"city": "Boston", "date": "2024-01-20"},
{"city": "NYC", "date": "2024-01-20"}
])
One tool call, parallel execution internally.
Execute likely tools before the LLM asks:
def predictive_execute(query):
# Start executing probable tools immediately
futures = {}
if "weather" in query.lower():
cities = extract_cities(query) # Simple NER, not LLM
for city in cities:
futures[city] = executor.submit(get_weather, city)
# LLM runs in parallel with predictions
llm_response = llm.complete(query)
# If LLM wanted weather, we already have it
if llm_response.tool == "get_weather":
city = llm_response.args["city"]
if city in futures:
return futures[city].result() # Already done!
class OptimizedToolExecutor:
def __init__(self):
self.compiled_chains = load_common_patterns()
self.predictor = ToolPredictor()
self.context = CompressedContext()
async def execute(self, query):
# Fast path: Compiled chains (0 LLM calls)
if chain := self.match_compiled(query):
return await self.execute_chain(chain)
# Start predictive execution
predictions = self.predictor.start_predictions(query)
# Get execution plan (1 LLM call)
plan = await self.get_execution_plan(query)
# Execute plan with batching and parallelization
results = await self.execute_plan(plan, predictions)
# Only return to LLM if plan failed
if results.needs_reasoning:
# Send compressed context, not full history
return await self.llm_complete(self.context.compress(results))
return results
Standard tool loop (5 sequential weather checks):
- Latency: 2,847ms
- Tokens: 4,832
- Cost: $0.07
Optimized approach:
- Latency: 312ms (single LLM call + parallel execution)
- Tokens: 234 (just the execution plan)
- Cost: $0.003
- Profile your tool patterns - Log every tool sequence for a week
- Compile the top 80% - Turn repeated sequences into templates
- Batch similar operations - Redesign tools to accept arrays
- Compress context aggressively - LLM only needs deltas
- Parallelize everything - No sequential tool calls, ever
- Cache tool schemas - Send once per session, not per call
LLM tool calling is an interpreter pattern when you need a compiler pattern:
- Interpreter (slow): Each step returns to LLM for next instruction
- Compiler (fast): LLM generates program, runtime executes it
Stop using the LLM as a for-loop controller. Use it as a query planner.
# 1. Parallel execution (easiest win)
async def execute_parallel(tool_calls):
return await asyncio.gather(*[
execute_tool(call) for call in tool_calls
])
# 2. Context caching (huge token savings)
def get_context(full_history):
if len(full_history) > 5:
return summarize(full_history[:-2]) + full_history[-2:]
return full_history
# 3. Tool result compression
def compress_for_llm(tool_result):
# Only fields that affect reasoning
return {k: v for k, v in tool_result.items()
if k in REASONING_FIELDS[tool_result.type]}
The standard tool loop is a teaching example, not a production pattern. Ship something faster.