The 422MB allocation failure is happening because LLM.swift sets:
contextParams.n_ctx = UInt32(maxTokenCount) // e.g., 2048
contextParams.n_batch = contextParams.n_ctx // ALSO 2048!
This means the compute graph is sized for processing ALL tokens at once, which is wasteful.
From llama.cpp logs:
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_aligned_malloc: insufficient memory (attempted to allocate 422.02 MB)
The compute graph size is proportional to n_batch
. With n_batch = n_ctx
, it's allocating memory for the worst case of processing the entire context at once.
The compute graph is a pre-allocated memory buffer that llama.cpp uses to store all the intermediate calculations during inference. Think of it as workspace memory where the model does its math.
When LLM.swift creates a context, it sets two critical parameters:
contextParams.n_ctx = 2048 // Maximum context size
contextParams.n_batch = 2048 // Batch processing size (SAME as context!)
The compute graph must be large enough to handle the worst-case scenario: processing n_batch
tokens at once. With n_batch = 2048
, it needs to allocate space for:
- All layer activations for 2048 tokens
- Attention matrices
- Intermediate calculations
- Gradient buffers (even though we're not training)
Setting n_batch = n_ctx
means the system prepares to process the ENTIRE conversation at once. But in reality:
- During generation, we process 1 token at a time
- During prompt processing, we rarely need to process 2048 tokens at once
- Most inputs are much smaller than the max context
Typical usage patterns:
- Token generation: n_batch = 1 (processing one token)
- Prompt processing: n_batch = prompt_length (usually < 512)
- Optimal batch: n_batch = 512 (good balance)
Memory comparison:
- n_batch = 2048: ~422MB compute graph
- n_batch = 512: ~105MB compute graph (75% reduction!)
- n_batch = 256: ~52MB compute graph (88% reduction!)
LLM.swift hardcodes n_batch = n_ctx
in its initialization. We can't change this without:
- Forking LLM.swift
- Creating our own llama.cpp wrapper
- Waiting for LLM.swift to add this feature
Since we can't change n_batch
, our only option is to reduce n_ctx
(maxTokenCount):
If maxTokenCount = 512:
- n_ctx = 512
- n_batch = 512 (because LLM.swift sets them equal)
- Compute graph ≈ 105MB (fits in memory!)
That's why reducing token limits helps:
- Lower maxTokenCount → Lower n_ctx → Lower n_batch → Smaller compute graph
But this is inefficient because it limits conversation length unnecessarily. Ideally, we'd have:
- n_ctx = 2048 (long conversations)
- n_batch = 512 (reasonable processing chunks)
Don't confuse these two:
- KV Cache (208MB): Stores attention keys/values for past tokens
- Compute Graph (422MB): Workspace for calculations
- Both are needed, but serve different purposes
Your iPhone 16 Pro Max has plenty of RAM, but:
- iOS is aggressive about memory management
- Other apps are using memory
- The 422MB allocation is a single, contiguous block
- Large allocations are more likely to fail
The proper fix would be to modify LLM.swift to accept a separate batch size parameter:
// Ideal configuration
contextParams.n_ctx = 2048 // Full context available
contextParams.n_batch = 512 // Process in reasonable chunks
This would give us the best of both worlds: long conversations with reasonable memory usage.
Most efficient - set n_batch
to a reasonable value like 512 or 256:
contextParams.n_batch = 512 // Process 512 tokens at a time
This would reduce compute graph memory significantly while still allowing full context.
Already implemented via maxTokenCount
, but this limits conversation length.
Create a custom initialization that allows separate batch size control.
Create an extension that reinitializes with better parameters.
Current workaround - use smaller context windows.
Create our own wrapper around llama.cpp with proper parameters.
The compute graph memory is roughly:
graph_size ≈ n_batch × n_layers × hidden_size × precision
For Gemma-2B with n_batch=2048:
- ~422MB for compute graph
With n_batch=512:
- ~105MB for compute graph (75% reduction!)
Until we can modify LLM.swift, use very aggressive token limits:
- Critical memory: 256 tokens
- High memory: 512 tokens
- Normal: 1024 tokens
This reduces both KV cache AND compute graph requirements.
Nice Explanation regarding that concept I didn't know this is this much deep, I was trying to clear the cache something at last but it didn't work either , you can rise the issue regrading this I would love to get this fix . Thanks for this detailed explanation. I would love hear more from you