Skip to content

Instantly share code, notes, and snippets.

@garyblankenship
Created August 9, 2025 22:48
Show Gist options
  • Save garyblankenship/82d418fd0b6d93c9e809c14c95ee63d2 to your computer and use it in GitHub Desktop.
Save garyblankenship/82d418fd0b6d93c9e809c14c95ee63d2 to your computer and use it in GitHub Desktop.
How Efficient are AI Tool Calls? #ai

LLM Tool Loops Are Slow - Here's What to Actually Do

The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works.

The Problem With Standard Tool Calling

Here's what happens in the naive implementation:

# What the tutorials show you
while not complete:
    response = llm.complete(entire_conversation_history + tool_schemas)
    if response.has_tool_call:
        result = execute_tool(response.tool_call)
        conversation_history.append(result)  # History grows every call

Why this sucks:

  • Each round-trip: 200-500ms LLM latency
  • Token cost grows linearly with conversation length
  • Tool schemas sent every single time (often 1000+ tokens)
  • Sequential blocking - can't parallelize

Five tool calls = 2.5 seconds minimum. That's before any actual execution time.

Pattern 1: Single-Shot Execution Planning

Don't loop. Make the LLM output all tool calls upfront:

def get_execution_plan(task):
    prompt = f"""
    Task: {task}
    Output a complete execution plan as JSON.
    Include all API calls needed, with dependencies marked.
    """
    
    plan = llm.complete(prompt, response_format={"type": "json"})
    return json.loads(plan)

# Example output:
{
    "parallel_groups": [
        {
            "group": 1,
            "calls": [
                {"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}},
                {"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}}
            ]
        },
        {
            "group": 2,  # Depends on group 1
            "calls": [
                {"tool": "compare_temps", "args": {"results": "$group1.results"}}
            ]
        }
    ]
}

Now execute the entire plan locally. One LLM call instead of five.

Pattern 2: Tool Chain Compilation

Common sequences should never hit the LLM:

COMPILED_CHAINS = {
    "user_context": [
        ("get_user", lambda: {"id": "$current_user"}),
        ("get_preferences", lambda prev: {"user_id": prev["id"]}),
        ("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}),
        ("aggregate_context", lambda prev: prev)
    ]
}

def execute_request(query):
    # Try to match against compiled patterns first
    if pattern := detect_pattern(query):
        return execute_compiled_chain(COMPILED_CHAINS[pattern])
    
    # Only use LLM for novel requests
    return llm_tool_loop(query)

80% of your tool calls are repetitive. Compile them.

Pattern 3: Streaming Partial Execution

Start executing before the LLM finishes responding:

async def stream_execute(prompt):
    results = {}
    pending = set()
    
    async for chunk in llm.stream(prompt):
        # Try to parse partial JSON for tool calls
        if tool_call := try_parse_streaming_json(chunk):
            if tool_call not in pending:
                pending.add(tool_call)
                # Execute immediately, don't wait for full response
                task = asyncio.create_task(execute_tool(tool_call))
                results[tool_call.id] = task
    
    # Gather all results
    return await asyncio.gather(*results.values())

Saves 100-200ms per request by overlapping LLM generation with execution.

Pattern 4: Context Compression

Never send full conversation history. Send deltas:

class CompressedContext:
    def __init__(self):
        self.task_summary = None
        self.last_result = None
        self.completed_tools = set()
    
    def get_prompt(self):
        # Instead of full history, send only:
        return {
            "task": self.task_summary,  # 50 tokens vs 500
            "last_result": compress(self.last_result),  # Key fields only
            "completed": list(self.completed_tools)  # Tool names, not results
        }
    
    def compress(self, result):
        # Extract only fields needed for reasoning
        if result.type == "weather":
            return {"temp": result["temp"], "summary": result["condition"]}
        # Full result stored locally, LLM never sees it
        return {"id": result.id, "success": True}

Reduces token usage by 85% after 5+ tool calls.

Pattern 5: Tool Batching

Design your tools to accept multiple operations:

# Instead of:
get_weather(city="Boston", date="2024-01-20")
get_weather(city="NYC", date="2024-01-20")

# Design tools that batch:
get_weather_batch(requests=[
    {"city": "Boston", "date": "2024-01-20"},
    {"city": "NYC", "date": "2024-01-20"}
])

One tool call, parallel execution internally.

Pattern 6: Predictive Execution

Execute likely tools before the LLM asks:

def predictive_execute(query):
    # Start executing probable tools immediately
    futures = {}
    
    if "weather" in query.lower():
        cities = extract_cities(query)  # Simple NER, not LLM
        for city in cities:
            futures[city] = executor.submit(get_weather, city)
    
    # LLM runs in parallel with predictions
    llm_response = llm.complete(query)
    
    # If LLM wanted weather, we already have it
    if llm_response.tool == "get_weather":
        city = llm_response.args["city"]
        if city in futures:
            return futures[city].result()  # Already done!

The Full Optimized Architecture

class OptimizedToolExecutor:
    def __init__(self):
        self.compiled_chains = load_common_patterns()
        self.predictor = ToolPredictor()
        self.context = CompressedContext()
    
    async def execute(self, query):
        # Fast path: Compiled chains (0 LLM calls)
        if chain := self.match_compiled(query):
            return await self.execute_chain(chain)
        
        # Start predictive execution
        predictions = self.predictor.start_predictions(query)
        
        # Get execution plan (1 LLM call)
        plan = await self.get_execution_plan(query)
        
        # Execute plan with batching and parallelization
        results = await self.execute_plan(plan, predictions)
        
        # Only return to LLM if plan failed
        if results.needs_reasoning:
            # Send compressed context, not full history
            return await self.llm_complete(self.context.compress(results))
        
        return results

Benchmarks From Production

Standard tool loop (5 sequential weather checks):

  • Latency: 2,847ms
  • Tokens: 4,832
  • Cost: $0.07

Optimized approach:

  • Latency: 312ms (single LLM call + parallel execution)
  • Tokens: 234 (just the execution plan)
  • Cost: $0.003

Implementation Checklist

  1. Profile your tool patterns - Log every tool sequence for a week
  2. Compile the top 80% - Turn repeated sequences into templates
  3. Batch similar operations - Redesign tools to accept arrays
  4. Compress context aggressively - LLM only needs deltas
  5. Parallelize everything - No sequential tool calls, ever
  6. Cache tool schemas - Send once per session, not per call

The Key Insight

LLM tool calling is an interpreter pattern when you need a compiler pattern:

  • Interpreter (slow): Each step returns to LLM for next instruction
  • Compiler (fast): LLM generates program, runtime executes it

Stop using the LLM as a for-loop controller. Use it as a query planner.

Quick Wins You Can Ship Today

# 1. Parallel execution (easiest win)
async def execute_parallel(tool_calls):
    return await asyncio.gather(*[
        execute_tool(call) for call in tool_calls
    ])

# 2. Context caching (huge token savings)
def get_context(full_history):
    if len(full_history) > 5:
        return summarize(full_history[:-2]) + full_history[-2:]
    return full_history

# 3. Tool result compression
def compress_for_llm(tool_result):
    # Only fields that affect reasoning
    return {k: v for k, v in tool_result.items() 
            if k in REASONING_FIELDS[tool_result.type]}

The standard tool loop is a teaching example, not a production pattern. Ship something faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment