Skip to content

Instantly share code, notes, and snippets.

View apollo-mg's full-sized avatar

Mark Galyan apollo-mg

  • Self Employed
  • Near Indianapolis Indiana
View GitHub Profile
@apollo-mg
apollo-mg / L3_Spine_Gatekeeper_PoC.md
Created March 22, 2026 03:28
Sovereign AI Architecture: The 'L3 Spine' Gatekeeper

Sovereign AI Architecture: The "L3 Spine" Gatekeeper

Executive Summary

In complex, localized AI architectures (like Project Apollo's multi-agent swarm), utilizing massive GPU VRAM for simple intent routing is computationally inefficient. This proof-of-concept demonstrates an air-gapped, zero-VRAM "Gatekeeper" node by pinning a hyper-quantized 135M parameter LLM strictly to a CPU's L3 V-Cache.

By leveraging native Linux CPU pinning (taskset) and rigorous grammar constraints (GBNF), we achieve deterministic, zero-hallucination JSON output at GPU-like bandwidths (~136 Tokens Per Second) while leaving the primary accelerator (RX 9070 XT) completely untouched.

Core Technical Concept

  1. Model: SmolLM2-135M-Instruct-Q4_K_M (~60MB working footprint).
  2. Hardware: AMD Ryzen 7 5700X3D (96MB L3 Cache).
@apollo-mg
apollo-mg / MODEL_TEST_LAB.md
Created March 10, 2026 03:36
Project Apollo: RDNA 4 Model Test Lab & Failure Modes

๐Ÿงช SOVEREIGN MODEL TEST LAB

Objective: Document empirical test data, failure modes, and VRAM footprints to determine the optimal model for specific agent workflows on the RX 9070 XT.

๐Ÿ† Current Resident Kings

1. The Logic Core: qwen3.5-9b-heretic:4bit

  • Architecture: Qwen 3.5 9B (Dense)
  • Format: GGUF Q4_K_M (Ollama)
  • VRAM Footprint: ~5.5 GB (Leaves room for ~100k context)
  • Speed (GFX1201): ~50-60 tokens/sec
@apollo-mg
apollo-mg / RDNA4_MASTER_LIST.md
Last active March 10, 2026 03:09
RDNA 4 (GFX1201) Poachers Reproduction Guide: Native High-Speed Vision

๐Ÿ›ธ RDNA 4 (GFX1201) AI MASTER LIST

Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)

๐ŸŸข 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

  • Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
  • 4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
  • Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
  • Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
  • Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.
@apollo-mg
apollo-mg / RDNA4_MASTER_LIST.md
Last active March 10, 2026 03:07
The Definitive RDNA 4 (GFX1201) AI Capability & Ecosystem Master List - March 2026

๐Ÿ›ธ RDNA 4 (GFX1201) AI MASTER LIST

Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)

๐ŸŸข 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

  • Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
  • 4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
  • Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
  • Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
  • Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.
@apollo-mg
apollo-mg / causal_conv1d_postmortem.md
Created March 9, 2026 20:07
Technical post-mortem: Causal-Conv1d installer failure on native RDNA 4 (GFX1201)

The NVCC Trap: Why Causal-Conv1d Fails on Native RDNA 4 (and how to bypass it)

Date: March 9, 2026 Hardware: AMD Radeon RX 9070 XT (gfx1201) Software: ROCm 7.2.0, PyTorch 2.12.0 (Nightly)

The Problem

As of early 2026, many frontier models (like Qwen 3.5 Unified Vision and Mamba-2) rely on `causal-conv1d`. On AMD hardware, attempting to install this package results in immediate failure, forcing the model into a "slow-path" fallback that pulls up to 320W and utilizes high CPU overhead for simple vision tasks.

The Forensic Breakdown

During a live engineering session on an RDNA 4 rig, we identified three fatal layers of hardcoding in the `dao-ailab/causal-conv1d` (v1.6.0) installer:

@apollo-mg
apollo-mg / rdna4_tilelang_postmortem.md
Created March 9, 2026 16:05
Technical post-mortem: TileLang kernel forging failures on RDNA 4 (GFX1201)

RDNA 4 (GFX1201) Technical Post-Mortem: TileLang Kernel Forging & The Wave32 Barrier

Executive Summary

This gist documents the first known attempt to use TileLang for custom kernel forging on RDNA 4 hardware (specifically the AMD Radeon RX 9070 XT, gfx1201). While TileLang is a powerful "Blacksmith's Kit" for AMD Instinct (CDNA) hardware, our research reveals critical architectural barriers when targeting consumer RDNA 4 cards.

๐Ÿ The Success: General Purpose Compute

We successfully compiled and executed a custom "Buffer Copy" smoke test kernel on the RX 9070 XT using TileLang's JIT backend and ROCm 7.2.

Key Finding: The core TileLang compiler and ROCm JIT pipeline are functional for standard memory operations and non-matrix compute on RDNA 4.

@apollo-mg
apollo-mg / RDNA4_ROCm7.2_Build_Guide.md
Created March 8, 2026 07:23
The RDNA 4 (RX 9070 XT) PyTorch & vLLM Build Guide

๐Ÿš€ The RDNA 4 (RX 9070 XT) PyTorch & vLLM Build Guide

โš ๏ธ ALPHA / EXPERIMENTAL RELEASE This guide outlines a "bleeding-edge" bare-metal compilation process for the AMD Radeon RX 9070 XT (GFX1201) using ROCm 7.2. These patches bypass undocumented compiler strictness changes and API mismatches between PyTorch, vLLM, and HuggingFace. It is provided "as-is" for the community. Use at your own risk.

If you own an AMD Radeon RX 9070 XT and want to run native local AI, you cannot use standard PyTorch binaries or Docker containers. You must compile from source against ROCm 7.2 using the gfx1201 architecture flag.

This guide contains the exact surgical patches required to bypass the bleeding-edge compiler errors.

Phase 1: PyTorch 2.4.0 Compilation

@apollo-mg
apollo-mg / RDNA4_Voice_Cloning_Guide.md
Created November 28, 2025 02:30
RDNA 4 Voice Cloning Guide (GPT-SoVITS & F5-TTS)

AI Voice Cloning on AMD RDNA 4 (RX 9070 XT): The "Masochist's Guide" to Success

Date: November 2025 Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 LTS (ROCm 7.1 Preview) Objective: Train commercial-grade voice clones (GPT-SoVITS & F5-TTS) without CUDA.


1. The Core Stability Fixes (The "Secret Sauce")

@apollo-mg
apollo-mg / RDNA4_WAN2.1_GUIDE.md
Created November 24, 2025 19:44
Stability Guide for Wan 2.1 on AMD RDNA 4 (RX 9070 XT)

Running Wan 2.1 (14B) on AMD RDNA 4 (RX 9070 XT) - Stability Guide

Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 / Linux ROCm: 7.0 / 7.1 Preview Goal: Stable Text-to-Video generation with Wan 2.1 (14B) without crashing or OOM.

The Problem

Running Wan 2.1 on RDNA 4 currently causes frequent HIP error: illegal memory access crashes or immediate OOMs during VAE decoding. This is due to kernel conflicts with PyTorch TunableOp and memory fragmentation.

@apollo-mg
apollo-mg / RDNA4_ROCm7_Guide.md
Last active March 31, 2026 21:28
Guide for running AI on AMD RDNA 4 GPUs with ROCm 7.1 on Linux (Updated with Performance Tuning)

Running AI on AMD RDNA 4 (RX 9000 Series): The ROCm 7.1 & Flash Attention Survival Guide

Hardware: AMD Radeon RX 9070 XT (gfx1201)
OS: Ubuntu 22.04 LTS
Software: ROCm 7.1.0, PyTorch 2.8 (Nightly/Custom)

Getting bleeding-edge AMD hardware to play nice with AI workflows often feels like solving a puzzle. If you picked up an RDNA 4 card (like the RX 9070 XT) and tried to install standard AI libraries, you likely hit walls of C++ assembly errors.

Here is the breakdown of why it fails, and the specific strategies to get Flash Attention 2, ComfyUI, and Flux.1 running natively.