To "beat" your 6-node Mac Mini M4 Pro cluster (384GB total unified memory, ~4 tokens/second for 4-bit quantized DeepSeek-V3 671B inference, based on scaling from 8-node benchmarks), we're targeting a single GPU that delivers higher inference speed (e.g., >4 t/s for the full 671B model) while keeping costs reasonable for a local setup. DeepSeek-V3's MoE architecture (only 37B active params/token) helps, but the model's ~386GB 4-bit footprint means no consumer GPU can load it fully in VRAM alone—you'll need CPU offloading (e.g., via llama.cpp or vLLM with 128GB+ system RAM). This hybrid approach is common and works well for interactive use.
The NVIDIA RTX 4090 (24GB VRAM) is the clear winner as the best single GPU alternative. It outperforms the cluster in raw speed for DeepSeek-V3 (5-15 t/s with optimizations, vs. your 4 t/s), costs far less ($1,600 vs. $12-15K for the cluster), and draws less power (~450W peak vs. ~300-400W total for 6 Min