You ran a single-shot throughput probe against a 120B model endpoint. Let’s decode it cleanly, then define what “good / better / best” actually means in this context.
- Prompt tokens: 33 → trivial input, not the bottleneck
- Completion tokens: 256 → fixed-size output sample
- Wall time: 14.78 s → end-to-end latency (includes overhead)
- Throughput: 17.33 tokens/sec → steady-state generation speed
👉 Translation: Your box (gx10) is generating ~17 tokens per second once it’s rolling.
- Output tokens/sec → core generation throughput (GPU + model + quantization)
- Time-to-first-token (TTFT) → responsiveness (not shown here)
- Concurrency behavior → does it collapse under load?
- Prompt tokens (too small to matter)
- Single-run wall time (no variance / no warm vs cold separation)
These are realistic field ranges, not vendor fantasy numbers:
-
< 10 tokens/sec
-
Symptoms:
- Underpowered GPU or bad quantization
- CPU offload choking
- VRAM pressure / swapping
- 10–20 tokens/sec
- This is where most self-hosted 70B–120B setups land
- Your result (17.3 t/s) is here
-
20–35 tokens/sec
-
Requires:
- High-end GPUs (A100/H100 class or well-optimized multi-GPU)
- Efficient quantization (AWQ, GPTQ, etc.)
- Clean pipeline (no I/O stalls)
-
35–60+ tokens/sec
-
Typically:
- Multi-GPU sharding done right
- Tensor parallelism tuned
- Kernel-level optimizations
- Minimal overhead stack
-
100+ tokens/sec
-
That’s:
- Vendor infra (OpenAI, Anthropic, etc.)
- Custom kernels + batching + speculative decoding
- Not apples-to-apples with your setup
17.3 tokens/sec = solidly “good”
Interpretation:
- Your system is healthy
- No obvious misconfiguration
- Likely not GPU-maxed, but also not broken
Based on that number alone:
-
You’re probably:
- Running quantized 120B (not full precision)
- On mid-to-high tier GPU(s) but not fully saturated
-
Bottlenecks could be:
- Memory bandwidth
- Interconnect (if multi-GPU)
- Kernel efficiency (common with Qwen builds)
This is a smoke test, not a performance characterization.
You don’t yet know:
-
TTFT (time to first token)
- Could be 0.5s or 8s → huge UX difference
-
Throughput under concurrency
-
17 t/s single-user might become:
- 9 t/s @ 2 users
- 3 t/s @ 4 users
-
That’s where systems actually fail
-
-
Token variance
- Was this steady or bursty?
High-impact levers:
- Move to AWQ / GPTQ tuned for your hardware
- Bad quantization can cost 30–50% throughput
- If supported → massive gains
- This is how vendors hit absurd speeds
- If GPU < 90% → you’re leaving performance on the table
- Often the hidden limiter on large models
-
Your number (17.3 t/s) is:
- Not slow
- Not elite
- Exactly where a sane 120B local deployment should be
If you told me this was broken, I’d push back. It isn’t.
If you want, next step is we turn this into a proper benchmark harness:
- TTFT
- sustained throughput
- concurrency curve
- cost-per-token (the one that actually matters)
That’s where the real signal shows up.