In complex, localized AI architectures (like Project Apollo's multi-agent swarm), utilizing massive GPU VRAM for simple intent routing is computationally inefficient. This proof-of-concept demonstrates an air-gapped, zero-VRAM "Gatekeeper" node by pinning a hyper-quantized 135M parameter LLM strictly to a CPU's L3 V-Cache.
By leveraging native Linux CPU pinning (taskset) and rigorous grammar constraints (GBNF), we achieve deterministic, zero-hallucination JSON output at GPU-like bandwidths (~136 Tokens Per Second) while leaving the primary accelerator (RX 9070 XT) completely untouched.
- Model:
SmolLM2-135M-Instruct-Q4_K_M(~60MB working footprint). - Hardware: AMD Ryzen 7 5700X3D (96MB L3 Cache).
- Constraint: Because the model footprint is smaller than the CPU's total L3 cache, we completely bypass DDR4 RAM latency, locking the model in the fastest tier of CPU memory.
We bind the inference engine (llama.cpp) to exactly two logical cores (Core 0 and Core 2) using taskset. This prevents OS scheduler thrashing, keeps cache-hits dense, and reserves the remaining 14 logical cores for the host OS and other swarm sub-agents.
taskset -c 0,2 llama-cli -m models/smollm2-135m-q4_k_m.gguf \
--prompt "User: Turn off the shop lights. Intent:" \
--grammar-file intent.gbnfSmall models (<1B parameters) are prone to hallucinating conversational filler ("I am an AI, here is your intent..."). We mathematically reject non-compliant output by wrapping the inference layer in a GBNF grammar parser.
# intent.gbnf
root ::= "{" ws "\"intent\"" ws ":" ws "\"" intenttype "\"" ws "}"
intenttype ::= "THINK" | "COMMIT" | "REACT" | "REJECT"
ws ::= [ \t\n]*
The architecture successfully bypassed all conversational tuning and output perfectly structured JSON at rapid speeds, strictly bound to CPU cache.
- VRAM Usage: 0.0 GB (No GPU utilization)
- System RAM Bandwidth: Negligible (Cache-resident)
- Throughput: ~136 Generation Tokens Per Second (on 2 Cores)
- Output Consistency: 100% adherence to defined JSON schema
{
"intent": "REACT"
}This "L3 Spine" serves as an ultra-fast, zero-cost triage router. It intercepts all incoming voice/text payloads, classifies the intent, and only awakens the heavy 35B/14B specialist models (and spins up the GPU) if complex reasoning is actually required.
This bare-metal, physics-first approach to multi-agent deployment is essential for running robust, autonomous AI swarms on edge hardware or consumer architecture without risking catastrophic VRAM Out-of-Memory (OOM) failures.