NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.
Key capabilities:
- Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
- KV-aware routing — Routes requests to workers with the highest cache hit rate
- KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput