apollo-mg / Llama-Server is Throwing Away Your Perfectly Good KV Caches, and How to Fix It.md

Last active July 6, 2026 00:01

Llama-Server is Throwing Away Your Perfectly Good KV Caches, and How to Fix It

Introduction

I watched 2.49 GB of state restore from disk in 1.23 seconds — and then get thrown away. llama-server's slot save/restore promises exactly what long-context work on budget hardware needs: park a session on disk, bring it back later without paying the prefill tax again. And the restore itself works perfectly. But across a process restart, the feature was functionally useless: the first query after restoring discarded the entire rehydrated state and re-prefilled from scratch. The reason turned out to be a single piece of metadata that lived only in process memory.

Testbed: Qwopus3.6-27B Q6_K (hybrid architecture), 2× Tesla P100 layer-split, turbo4 KV quantization.

The Discovery

apollo-mg / L3_Spine_Gatekeeper_PoC.md

Created March 22, 2026 03:28

Sovereign AI Architecture: The 'L3 Spine' Gatekeeper

Sovereign AI Architecture: The "L3 Spine" Gatekeeper

Executive Summary

In complex, localized AI architectures (like Project Apollo's multi-agent swarm), utilizing massive GPU VRAM for simple intent routing is computationally inefficient. This proof-of-concept demonstrates an air-gapped, zero-VRAM "Gatekeeper" node by pinning a hyper-quantized 135M parameter LLM strictly to a CPU's L3 V-Cache.

By leveraging native Linux CPU pinning (taskset) and rigorous grammar constraints (GBNF), we achieve deterministic, zero-hallucination JSON output at GPU-like bandwidths (~136 Tokens Per Second) while leaving the primary accelerator (RX 9070 XT) completely untouched.

Core Technical Concept

Model: SmolLM2-135M-Instruct-Q4_K_M (~60MB working footprint).
Hardware: AMD Ryzen 7 5700X3D (96MB L3 Cache).

apollo-mg / MODEL_TEST_LAB.md

Created March 10, 2026 03:36

Project Apollo: RDNA 4 Model Test Lab & Failure Modes

🧪 SOVEREIGN MODEL TEST LAB

Objective: Document empirical test data, failure modes, and VRAM footprints to determine the optimal model for specific agent workflows on the RX 9070 XT.

🏆 Current Resident Kings

1. The Logic Core: `qwen3.5-9b-heretic:4bit`

Architecture: Qwen 3.5 9B (Dense)
Format: GGUF Q4_K_M (Ollama)
VRAM Footprint: ~5.5 GB (Leaves room for ~100k context)
Speed (GFX1201): ~50-60 tokens/sec

apollo-mg / RDNA4_MASTER_LIST.md

Last active March 10, 2026 03:09

RDNA 4 (GFX1201) Poachers Reproduction Guide: Native High-Speed Vision

🛸 RDNA 4 (GFX1201) AI MASTER LIST

Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)

🟢 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.

apollo-mg / RDNA4_MASTER_LIST.md

Last active March 10, 2026 03:07

The Definitive RDNA 4 (GFX1201) AI Capability & Ecosystem Master List - March 2026

🛸 RDNA 4 (GFX1201) AI MASTER LIST

Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)

🟢 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.

apollo-mg / causal_conv1d_postmortem.md

Created March 9, 2026 20:07

Technical post-mortem: Causal-Conv1d installer failure on native RDNA 4 (GFX1201)

The NVCC Trap: Why Causal-Conv1d Fails on Native RDNA 4 (and how to bypass it)

Date: March 9, 2026 Hardware: AMD Radeon RX 9070 XT (gfx1201) Software: ROCm 7.2.0, PyTorch 2.12.0 (Nightly)

The Problem

As of early 2026, many frontier models (like Qwen 3.5 Unified Vision and Mamba-2) rely on `causal-conv1d`. On AMD hardware, attempting to install this package results in immediate failure, forcing the model into a "slow-path" fallback that pulls up to 320W and utilizes high CPU overhead for simple vision tasks.

The Forensic Breakdown

During a live engineering session on an RDNA 4 rig, we identified three fatal layers of hardcoding in the `dao-ailab/causal-conv1d` (v1.6.0) installer:

apollo-mg / rdna4_tilelang_postmortem.md

Created March 9, 2026 16:05

Technical post-mortem: TileLang kernel forging failures on RDNA 4 (GFX1201)

RDNA 4 (GFX1201) Technical Post-Mortem: TileLang Kernel Forging & The Wave32 Barrier

Executive Summary

This gist documents the first known attempt to use TileLang for custom kernel forging on RDNA 4 hardware (specifically the AMD Radeon RX 9070 XT, gfx1201). While TileLang is a powerful "Blacksmith's Kit" for AMD Instinct (CDNA) hardware, our research reveals critical architectural barriers when targeting consumer RDNA 4 cards.

🏁 The Success: General Purpose Compute

We successfully compiled and executed a custom "Buffer Copy" smoke test kernel on the RX 9070 XT using TileLang's JIT backend and ROCm 7.2.

Key Finding: The core TileLang compiler and ROCm JIT pipeline are functional for standard memory operations and non-matrix compute on RDNA 4.

apollo-mg / RDNA4_ROCm7.2_Build_Guide.md

Created March 8, 2026 07:23

The RDNA 4 (RX 9070 XT) PyTorch & vLLM Build Guide

🚀 The RDNA 4 (RX 9070 XT) PyTorch & vLLM Build Guide

⚠️ ALPHA / EXPERIMENTAL RELEASE This guide outlines a "bleeding-edge" bare-metal compilation process for the AMD Radeon RX 9070 XT (GFX1201) using ROCm 7.2. These patches bypass undocumented compiler strictness changes and API mismatches between PyTorch, vLLM, and HuggingFace. It is provided "as-is" for the community. Use at your own risk.

If you own an AMD Radeon RX 9070 XT and want to run native local AI, you cannot use standard PyTorch binaries or Docker containers. You must compile from source against ROCm 7.2 using the gfx1201 architecture flag.

This guide contains the exact surgical patches required to bypass the bleeding-edge compiler errors.

Phase 1: PyTorch 2.4.0 Compilation

apollo-mg / RDNA4_Voice_Cloning_Guide.md

Created November 28, 2025 02:30

RDNA 4 Voice Cloning Guide (GPT-SoVITS & F5-TTS)

AI Voice Cloning on AMD RDNA 4 (RX 9070 XT): The "Masochist's Guide" to Success

Date: November 2025 Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 LTS (ROCm 7.1 Preview) Objective: Train commercial-grade voice clones (GPT-SoVITS & F5-TTS) without CUDA.

1. The Core Stability Fixes (The "Secret Sauce")

apollo-mg / RDNA4_WAN2.1_GUIDE.md

Created November 24, 2025 19:44

Stability Guide for Wan 2.1 on AMD RDNA 4 (RX 9070 XT)

Running Wan 2.1 (14B) on AMD RDNA 4 (RX 9070 XT) - Stability Guide

Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 / Linux ROCm: 7.0 / 7.1 Preview Goal: Stable Text-to-Video generation with Wan 2.1 (14B) without crashing or OOM.

The Problem

Running Wan 2.1 on RDNA 4 currently causes frequent HIP error: illegal memory access crashes or immediate OOMs during VAE decoding. This is due to kernel conflicts with PyTorch TunableOp and memory fragmentation.

Mark Galyan apollo-mg

Introduction

The Discovery

Sovereign AI Architecture: The "L3 Spine" Gatekeeper

Executive Summary

Core Technical Concept

🧪 SOVEREIGN MODEL TEST LAB

🏆 Current Resident Kings

1. The Logic Core: qwen3.5-9b-heretic:4bit

🛸 RDNA 4 (GFX1201) AI MASTER LIST

🟢 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

🛸 RDNA 4 (GFX1201) AI MASTER LIST

🟢 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

The NVCC Trap: Why Causal-Conv1d Fails on Native RDNA 4 (and how to bypass it)

The Problem

The Forensic Breakdown

RDNA 4 (GFX1201) Technical Post-Mortem: TileLang Kernel Forging & The Wave32 Barrier

Executive Summary

🏁 The Success: General Purpose Compute

🚀 The RDNA 4 (RX 9070 XT) PyTorch & vLLM Build Guide

Phase 1: PyTorch 2.4.0 Compilation

AI Voice Cloning on AMD RDNA 4 (RX 9070 XT): The "Masochist's Guide" to Success

1. The Core Stability Fixes (The "Secret Sauce")

Running Wan 2.1 (14B) on AMD RDNA 4 (RX 9070 XT) - Stability Guide

The Problem

1. The Logic Core: `qwen3.5-9b-heretic:4bit`