Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created November 13, 2025 21:06
Show Gist options
  • Select an option

  • Save decagondev/812af3c05c1814f6d783deac71fd1c2d to your computer and use it in GitHub Desktop.

Select an option

Save decagondev/812af3c05c1814f6d783deac71fd1c2d to your computer and use it in GitHub Desktop.

Best Single GPU to Beat the 6x Mac Mini M4 Pro Cluster

To "beat" your 6-node Mac Mini M4 Pro cluster (384GB total unified memory, ~4 tokens/second for 4-bit quantized DeepSeek-V3 671B inference, based on scaling from 8-node benchmarks), we're targeting a single GPU that delivers higher inference speed (e.g., >4 t/s for the full 671B model) while keeping costs reasonable for a local setup. DeepSeek-V3's MoE architecture (only 37B active params/token) helps, but the model's ~386GB 4-bit footprint means no consumer GPU can load it fully in VRAM alone—you'll need CPU offloading (e.g., via llama.cpp or vLLM with 128GB+ system RAM). This hybrid approach is common and works well for interactive use.

The NVIDIA RTX 4090 (24GB VRAM) is the clear winner as the best single GPU alternative. It outperforms the cluster in raw speed for DeepSeek-V3 (5-15 t/s with optimizations, vs. your 4 t/s), costs far less ($1,600 vs. $12-15K for the cluster), and draws less power (~450W peak vs. ~300-400W total for 6 Minis). It's consumer-grade, widely supported (CUDA ecosystem), and excels in benchmarks for quantized LLMs. Here's why it edges out the cluster and others:

Key Performance Comparison

Using 4-bit quantization and batch=1 inference (typical for chat-like use). Speeds vary by framework (e.g., TensorRT-LLM or SGLang for max gains) and system (pair with a high-RAM CPU like Ryzen 9 7950X + 128GB DDR5 for offloading).

Setup Est. Tokens/Second (DeepSeek-V3 671B, 4-bit) VRAM/Memory Used Power Draw Cost (Approx., 2025) Notes
6x Mac Mini M4 Pro (64GB each) ~4 t/s 384GB unified (distributed) 300-400W total $12-15K Efficient & quiet; great for privacy/eco setups but setup complexity and interconnect limits scaling. MoE shines on unified memory.
NVIDIA RTX 4090 (single) 5-15 t/s 24GB VRAM + ~362GB system RAM (offload) 450W peak $1,600 (GPU) + $1-2K PC Beats cluster: 1.25-3.75x faster decode; lower TTFT (~1-2s). Use with 128GB+ RAM for smooth offload. Excels in TensorRT-LLM/SGLang.
NVIDIA RTX 5090 (32GB, hypothetical 2025) 7-20 t/s (est.) 32GB VRAM + ~354GB system RAM 600W peak $2,000+ (GPU) Potential upgrade if available; ~30-50% faster than 4090 but pricier/hotter.
NVIDIA RTX 6000 Ada (48GB) 6-12 t/s 48GB VRAM + ~338GB system RAM 300W $6,000+ (GPU) Workstation alternative; better for pro use but overkill/expensive vs. 4090.
AMD RX 7900 XTX (24GB) 3-8 t/s 24GB VRAM + offload 355W $1,000 Close contender but ROCm support lags CUDA for LLMs; ~20-30% slower than 4090.
  • Why RTX 4090 Wins: Benchmarks show it handling large quantized MoE models like DeepSeek-V3 at usable speeds with offload (e.g., 14 t/s decode in hybrid CPU/GPU setups). Its 1 TB/s bandwidth crushes the Minis' per-node limits, reducing bottlenecks in matrix ops. For smaller DeepSeek variants (e.g., 33B), it hits 40-60 t/s—overkill but future-proof. Vs. the cluster, it's simpler (no multi-node sync) and portable in a desktop.
  • Limitations: Offload adds ~10-20% overhead vs. full VRAM load (impossible here). For >15 t/s, you'd need datacenter GPUs like H100 (80GB, $25K+, 20-30 t/s but not "single consumer"). Power/heat is higher, so good cooling matters.

Setup Recommendations

  • Full Build: RTX 4090 + AMD Ryzen 9 (16+ cores) + 128-256GB DDR5 RAM + 2TB NVMe SSD. Total: ~$3-4K. Use Ubuntu/Windows with CUDA 12.4+.
  • Software:
    • llama.cpp for easy hybrid inference (CPU offload seamless).
    • vLLM or SGLang for optimized serving (FP8/INT4 support boosts to 10+ t/s).
    • Quantize via Hugging Face: gguf format for DeepSeek-V3.
  • Test It: Start with smaller models (DeepSeek-Coder 7B) to benchmark—expect 100+ t/s on 4090.
  • If Budget < $1K: Go AMD RX 7900 XTX for ~80% of 4090 speed at lower cost, but expect CUDA-like ecosystem pains.

This swaps cluster hassle for single-GPU simplicity while boosting speed—ideal for local DeepSeek runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment