bench.md

You ran a single-shot throughput probe against a 120B model endpoint. Let’s decode it cleanly, then define what “good / better / best” actually means in this context.

What you actually measured (no fluff)

Prompt tokens: 33 → trivial input, not the bottleneck
Completion tokens: 256 → fixed-size output sample
Wall time: 14.78 s → end-to-end latency (includes overhead)
Throughput: 17.33 tokens/sec → steady-state generation speed

👉 Translation: Your box (gx10) is generating ~17 tokens per second once it’s rolling.

What matters vs what doesn’t

What matters

Output tokens/sec → core generation throughput (GPU + model + quantization)
Time-to-first-token (TTFT) → responsiveness (not shown here)
Concurrency behavior → does it collapse under load?

What doesn’t (much here)

Prompt tokens (too small to matter)
Single-run wall time (no variance / no warm vs cold separation)

Ground truth ranges (120B-class models)

These are realistic field ranges, not vendor fantasy numbers:

🟥 Poor

< 10 tokens/sec
Symptoms:
- Underpowered GPU or bad quantization
- CPU offload choking
- VRAM pressure / swapping

🟨 Acceptable / “Good”

10–20 tokens/sec
This is where most self-hosted 70B–120B setups land
Your result (17.3 t/s) is here

🟩 Strong / “Better”

20–35 tokens/sec
Requires:
- High-end GPUs (A100/H100 class or well-optimized multi-GPU)
- Efficient quantization (AWQ, GPTQ, etc.)
- Clean pipeline (no I/O stalls)

🟦 Excellent / “Best-in-class self-hosted”

35–60+ tokens/sec
Typically:
- Multi-GPU sharding done right
- Tensor parallelism tuned
- Kernel-level optimizations
- Minimal overhead stack

🟪 “Unfair comparison” tier

100+ tokens/sec
That’s:
- Vendor infra (OpenAI, Anthropic, etc.)
- Custom kernels + batching + speculative decoding
- Not apples-to-apples with your setup

Where your result sits

17.3 tokens/sec = solidly “good”

Interpretation:

Your system is healthy
No obvious misconfiguration
Likely not GPU-maxed, but also not broken

What this implies about gx10

Based on that number alone:

You’re probably:
- Running quantized 120B (not full precision)
- On mid-to-high tier GPU(s) but not fully saturated
Bottlenecks could be:
- Memory bandwidth
- Interconnect (if multi-GPU)
- Kernel efficiency (common with Qwen builds)

What’s missing (and more important than the number)

This is a smoke test, not a performance characterization.

You don’t yet know:

TTFT (time to first token)
- Could be 0.5s or 8s → huge UX difference
Throughput under concurrency
- 17 t/s single-user might become:
  - 9 t/s @ 2 users
  - 3 t/s @ 4 users
- That’s where systems actually fail
Token variance
- Was this steady or bursty?

If you want to move from “good” → “better”

High-impact levers:

1. Quantization tuning

Move to AWQ / GPTQ tuned for your hardware
Bad quantization can cost 30–50% throughput

2. Batch / speculative decoding

If supported → massive gains
This is how vendors hit absurd speeds

3. GPU utilization check

If GPU < 90% → you’re leaving performance on the table

4. KV cache + memory layout

Often the hidden limiter on large models

Bottom line

Your number (17.3 t/s) is:
- Not slow
- Not elite
- Exactly where a sane 120B local deployment should be

If you told me this was broken, I’d push back. It isn’t.

If you want, next step is we turn this into a proper benchmark harness:

TTFT
sustained throughput
concurrency curve
cost-per-token (the one that actually matters)

That’s where the real signal shows up.

louspringer/bench.md

Select an option

No results found