In complex, localized AI architectures (like Project Apollo's multi-agent swarm), utilizing massive GPU VRAM for simple intent routing is computationally inefficient. This proof-of-concept demonstrates an air-gapped, zero-VRAM "Gatekeeper" node by pinning a hyper-quantized 135M parameter LLM strictly to a CPU's L3 V-Cache.
By leveraging native Linux CPU pinning (taskset) and rigorous grammar constraints (GBNF), we achieve deterministic, zero-hallucination JSON output at GPU-like bandwidths (~136 Tokens Per Second) while leaving the primary accelerator (RX 9070 XT) completely untouched.
- Model:
SmolLM2-135M-Instruct-Q4_K_M(~60MB working footprint). - Hardware: AMD Ryzen 7 5700X3D (96MB L3 Cache).