An idle 80GB A100 is a terrible thing to waste. So I spent an afternoon measuring what actually makes local inference fast — and the answer surprised me less for what it is than for how big the gap turned out to be.
One A100-SXM4-80GB, sitting at 0% utilization and 36°C. The plan: run a clean throughput sweep across four open-weight GGUF models at several quantization levels, and see where the bottlenecks really are.
Tooling was deliberately boring, because boring is reproducible: