Skip to content

Instantly share code, notes, and snippets.

View BenHamm's full-sized avatar

Ben Hamm BenHamm

  • OctoML
  • Seattle, WA
View GitHub Profile
@BenHamm
BenHamm / AIC_PREDICTION_MISMATCH_GIST.md
Last active December 1, 2025 23:24
AIConfigurator Prediction Mismatch: 7-8% vs 102-148% Disaggregated Serving Performance Gains

AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.

Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.

Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).

@BenHamm
BenHamm / AIC_WALKTHROUGH_GUIDE.md
Last active December 19, 2025 17:02
AIConfigurator Walkthrough: Finding Optimal LLM Deployment Configurations

AIConfigurator: Fast-Track Your LLM Deployment on NVIDIA Dynamo

What is NVIDIA Dynamo?

NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.

Key capabilities:

  • Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
  • KV-aware routing — Routes requests to workers with the highest cache hit rate
  • KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput
@BenHamm
BenHamm / BENCHMARK_RESULTS_GIST.md
Created December 17, 2025 23:35
Qwen3-32B Disaggregated Serving Benchmark Results - AIConfigurator vs Actual Performance

Qwen3-32B Disaggregated Serving Benchmark Results

Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo


1. Cluster Configuration