Compute Graph Memory Allocation Fix

Problem Analysis

The 422MB allocation failure is happening because LLM.swift sets:

contextParams.n_ctx = UInt32(maxTokenCount)      // e.g., 2048
contextParams.n_batch = contextParams.n_ctx      // ALSO 2048!

This means the compute graph is sized for processing ALL tokens at once, which is wasteful.

Root Cause

From llama.cpp logs:

llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_aligned_malloc: insufficient memory (attempted to allocate 422.02 MB)

The compute graph size is proportional to n_batch. With n_batch = n_ctx, it's allocating memory for the worst case of processing the entire context at once.

Detailed Explanation

1. What is the Compute Graph?

The compute graph is a pre-allocated memory buffer that llama.cpp uses to store all the intermediate calculations during inference. Think of it as workspace memory where the model does its math.

2. Why 422MB?

When LLM.swift creates a context, it sets two critical parameters:

contextParams.n_ctx = 2048      // Maximum context size
contextParams.n_batch = 2048     // Batch processing size (SAME as context!)

The compute graph must be large enough to handle the worst-case scenario: processing n_batch tokens at once. With n_batch = 2048, it needs to allocate space for:

All layer activations for 2048 tokens
Attention matrices
Intermediate calculations
Gradient buffers (even though we're not training)

3. The Inefficiency

Setting n_batch = n_ctx means the system prepares to process the ENTIRE conversation at once. But in reality:

During generation, we process 1 token at a time
During prompt processing, we rarely need to process 2048 tokens at once
Most inputs are much smaller than the max context

4. How Batch Size Affects Memory

Typical usage patterns:
- Token generation: n_batch = 1 (processing one token)
- Prompt processing: n_batch = prompt_length (usually < 512)
- Optimal batch: n_batch = 512 (good balance)

Memory comparison:
- n_batch = 2048: ~422MB compute graph
- n_batch = 512:  ~105MB compute graph (75% reduction!)
- n_batch = 256:  ~52MB compute graph (88% reduction!)

5. Why Can't We Just Change It?

LLM.swift hardcodes n_batch = n_ctx in its initialization. We can't change this without:

Forking LLM.swift
Creating our own llama.cpp wrapper
Waiting for LLM.swift to add this feature

6. The Relationship to Token Limits

Since we can't change n_batch, our only option is to reduce n_ctx (maxTokenCount):

If maxTokenCount = 512:
- n_ctx = 512
- n_batch = 512 (because LLM.swift sets them equal)
- Compute graph ≈ 105MB (fits in memory!)

7. Current Workaround

That's why reducing token limits helps:

Lower maxTokenCount → Lower n_ctx → Lower n_batch → Smaller compute graph

But this is inefficient because it limits conversation length unnecessarily. Ideally, we'd have:

n_ctx = 2048 (long conversations)
n_batch = 512 (reasonable processing chunks)

8. The KV Cache vs Compute Graph

Don't confuse these two:

KV Cache (208MB): Stores attention keys/values for past tokens
Compute Graph (422MB): Workspace for calculations
Both are needed, but serve different purposes

9. Why This Matters for iPhone

Your iPhone 16 Pro Max has plenty of RAM, but:

iOS is aggressive about memory management
Other apps are using memory
The 422MB allocation is a single, contiguous block
Large allocations are more likely to fail

10. Long-term Solution

The proper fix would be to modify LLM.swift to accept a separate batch size parameter:

// Ideal configuration
contextParams.n_ctx = 2048    // Full context available
contextParams.n_batch = 512   // Process in reasonable chunks

This would give us the best of both worlds: long conversations with reasonable memory usage.

Solution Options

1. Reduce n_batch (Recommended)

Most efficient - set n_batch to a reasonable value like 512 or 256:

contextParams.n_batch = 512  // Process 512 tokens at a time

This would reduce compute graph memory significantly while still allowing full context.

2. Reduce n_ctx

Already implemented via maxTokenCount, but this limits conversation length.

3. Fork/Patch LLM.swift

Create a custom initialization that allows separate batch size control.

Implementation Approaches

A. Monkey-patch LLM.swift

Create an extension that reinitializes with better parameters.

B. Use Lower maxTokenCount

Current workaround - use smaller context windows.

C. Custom LLM Wrapper

Create our own wrapper around llama.cpp with proper parameters.

Memory Calculation

The compute graph memory is roughly:

graph_size ≈ n_batch × n_layers × hidden_size × precision

For Gemma-2B with n_batch=2048:

~422MB for compute graph

With n_batch=512:

~105MB for compute graph (75% reduction!)

Immediate Workaround

Until we can modify LLM.swift, use very aggressive token limits:

Critical memory: 256 tokens
High memory: 512 tokens
Normal: 1024 tokens

This reduces both KV cache AND compute graph requirements.

daviddahl/gist:138d4f3fdcf6c939b43db1717f3a4264