Skip to content

Instantly share code, notes, and snippets.

@daviddahl
Last active June 24, 2025 06:20
Show Gist options
  • Save daviddahl/138d4f3fdcf6c939b43db1717f3a4264 to your computer and use it in GitHub Desktop.
Save daviddahl/138d4f3fdcf6c939b43db1717f3a4264 to your computer and use it in GitHub Desktop.
LLM.swift configuration analysis

Compute Graph Memory Allocation Fix

Problem Analysis

The 422MB allocation failure is happening because LLM.swift sets:

contextParams.n_ctx = UInt32(maxTokenCount)      // e.g., 2048
contextParams.n_batch = contextParams.n_ctx      // ALSO 2048!

This means the compute graph is sized for processing ALL tokens at once, which is wasteful.

Root Cause

From llama.cpp logs:

llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_aligned_malloc: insufficient memory (attempted to allocate 422.02 MB)

The compute graph size is proportional to n_batch. With n_batch = n_ctx, it's allocating memory for the worst case of processing the entire context at once.

Detailed Explanation

1. What is the Compute Graph?

The compute graph is a pre-allocated memory buffer that llama.cpp uses to store all the intermediate calculations during inference. Think of it as workspace memory where the model does its math.

2. Why 422MB?

When LLM.swift creates a context, it sets two critical parameters:

contextParams.n_ctx = 2048      // Maximum context size
contextParams.n_batch = 2048     // Batch processing size (SAME as context!)

The compute graph must be large enough to handle the worst-case scenario: processing n_batch tokens at once. With n_batch = 2048, it needs to allocate space for:

  • All layer activations for 2048 tokens
  • Attention matrices
  • Intermediate calculations
  • Gradient buffers (even though we're not training)

3. The Inefficiency

Setting n_batch = n_ctx means the system prepares to process the ENTIRE conversation at once. But in reality:

  • During generation, we process 1 token at a time
  • During prompt processing, we rarely need to process 2048 tokens at once
  • Most inputs are much smaller than the max context

4. How Batch Size Affects Memory

Typical usage patterns:
- Token generation: n_batch = 1 (processing one token)
- Prompt processing: n_batch = prompt_length (usually < 512)
- Optimal batch: n_batch = 512 (good balance)

Memory comparison:
- n_batch = 2048: ~422MB compute graph
- n_batch = 512:  ~105MB compute graph (75% reduction!)
- n_batch = 256:  ~52MB compute graph (88% reduction!)

5. Why Can't We Just Change It?

LLM.swift hardcodes n_batch = n_ctx in its initialization. We can't change this without:

  • Forking LLM.swift
  • Creating our own llama.cpp wrapper
  • Waiting for LLM.swift to add this feature

6. The Relationship to Token Limits

Since we can't change n_batch, our only option is to reduce n_ctx (maxTokenCount):

If maxTokenCount = 512:
- n_ctx = 512
- n_batch = 512 (because LLM.swift sets them equal)
- Compute graph ≈ 105MB (fits in memory!)

7. Current Workaround

That's why reducing token limits helps:

  • Lower maxTokenCount → Lower n_ctx → Lower n_batch → Smaller compute graph

But this is inefficient because it limits conversation length unnecessarily. Ideally, we'd have:

  • n_ctx = 2048 (long conversations)
  • n_batch = 512 (reasonable processing chunks)

8. The KV Cache vs Compute Graph

Don't confuse these two:

  • KV Cache (208MB): Stores attention keys/values for past tokens
  • Compute Graph (422MB): Workspace for calculations
  • Both are needed, but serve different purposes

9. Why This Matters for iPhone

Your iPhone 16 Pro Max has plenty of RAM, but:

  • iOS is aggressive about memory management
  • Other apps are using memory
  • The 422MB allocation is a single, contiguous block
  • Large allocations are more likely to fail

10. Long-term Solution

The proper fix would be to modify LLM.swift to accept a separate batch size parameter:

// Ideal configuration
contextParams.n_ctx = 2048    // Full context available
contextParams.n_batch = 512   // Process in reasonable chunks

This would give us the best of both worlds: long conversations with reasonable memory usage.

Solution Options

1. Reduce n_batch (Recommended)

Most efficient - set n_batch to a reasonable value like 512 or 256:

contextParams.n_batch = 512  // Process 512 tokens at a time

This would reduce compute graph memory significantly while still allowing full context.

2. Reduce n_ctx

Already implemented via maxTokenCount, but this limits conversation length.

3. Fork/Patch LLM.swift

Create a custom initialization that allows separate batch size control.

Implementation Approaches

A. Monkey-patch LLM.swift

Create an extension that reinitializes with better parameters.

B. Use Lower maxTokenCount

Current workaround - use smaller context windows.

C. Custom LLM Wrapper

Create our own wrapper around llama.cpp with proper parameters.

Memory Calculation

The compute graph memory is roughly:

graph_size ≈ n_batch × n_layers × hidden_size × precision

For Gemma-2B with n_batch=2048:

  • ~422MB for compute graph

With n_batch=512:

  • ~105MB for compute graph (75% reduction!)

Immediate Workaround

Until we can modify LLM.swift, use very aggressive token limits:

  • Critical memory: 256 tokens
  • High memory: 512 tokens
  • Normal: 1024 tokens

This reduces both KV cache AND compute graph requirements.

@shashankb-cc
Copy link

Nice Explanation regarding that concept I didn't know this is this much deep, I was trying to clear the cache something at last but it didn't work either , you can rise the issue regrading this I would love to get this fix . Thanks for this detailed explanation. I would love hear more from you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment